[2025-08-20 08:10:54,076][mllm.models.large_language_model_local][INFO] - Initializing adapter 'agent_adapter': no initial weights provided or found; starting from scratch.
[2025-08-20 08:10:55,185][mllm.models.adapter_training_wrapper][INFO] - Adapter 'agent_adapter': initialized with fresh weights (no initial weights found).
[2025-08-20 08:10:55,192][mllm.models.large_language_model_local][INFO] - Initializing adapter 'critic_adapter': no initial weights provided or found; starting from scratch.
[2025-08-20 08:10:56,011][mllm.models.adapter_training_wrapper][INFO] - Adapter 'critic_adapter': initialized with fresh weights (no initial weights found).
[2025-08-20 08:12:42,003][__main__][INFO] - Starting iteration 0.
[2025-08-20 08:13:04,747][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 66.42587245075853%,
            
[2025-08-20 08:13:04,748][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 66.42587245075853%,
            
[2025-08-20 08:13:04,754][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 66.42587245075853%,
            
[2025-08-20 08:13:07,155][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 66.42587245075853%,
            
[2025-08-20 08:13:07,157][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 66.42587245075853%,
            
[2025-08-20 08:13:07,163][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 66.42587245075853%,
            
[2025-08-20 08:13:07,165][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data.
[2025-08-20 08:13:07,166][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets.
[2025-08-20 08:13:07,795][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:13:08,902][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:13:09,686][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:13:10,470][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:13:11,252][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:13:12,035][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:13:12,818][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:13:13,601][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:13:14,386][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:13:15,172][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:13:15,955][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:13:16,738][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:13:17,523][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:13:18,878][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:13:19,662][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:13:20,447][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:13:21,231][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:13:22,015][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:13:22,800][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:13:23,586][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:13:24,371][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:13:25,155][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:13:25,941][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:13:26,727][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:13:27,511][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:13:28,296][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:13:29,084][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:13:29,870][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:13:30,655][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:13:31,443][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:13:32,228][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:13:33,014][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:13:34,520][mllm.training.trainer_common][INFO] - For task: Apply reinforce step
,
            ΔVRAM Allocated: 0.6203231811523438 GB,
            ΔVRAM Reserved: 0.9375 GB,
            ΔTime: 00:00:27,
            Percentage of VRAM taken: 69.91579512747062%,
            
[2025-08-20 08:13:35,498][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/policy_optimizer_state.pt
[2025-08-20 08:13:35,500][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/critic_optimizer_state.pt
[2025-08-20 08:13:36,932][__main__][INFO] - Iteration 1 took 54s (36.99% Gen, 63.01% Train). Generation: 20s, Training: 34s. Estimated remaining time: 15h 12m 23s. Estimated total time: 15h 15m 29s. Time estimates for 10 more iterations: 9m 9s, 100 more iterations: 1h 31m 32s, 500 more iterations: 7h 37m 44s.
[2025-08-20 08:13:36,934][__main__][INFO] - Starting iteration 1.
[2025-08-20 08:13:59,997][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 08:13:59,998][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 08:14:00,004][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 08:14:02,451][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 08:14:02,452][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 08:14:02,458][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 08:14:02,461][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data.
[2025-08-20 08:14:02,461][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets.
[2025-08-20 08:14:02,760][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:14:03,546][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:14:04,338][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:14:05,127][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:14:05,920][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:14:06,710][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:14:07,500][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:14:08,292][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:14:09,082][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:14:09,873][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:14:11,139][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:14:11,930][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:14:12,720][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:14:13,512][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:14:14,303][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:14:15,097][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:14:15,889][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:14:16,680][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:14:17,473][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:14:18,264][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:14:19,057][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:14:19,850][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:14:20,641][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:14:21,436][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:14:22,228][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:14:23,019][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:14:23,814][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:14:24,606][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:14:25,398][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:14:26,193][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:14:26,985][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:14:27,778][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:14:29,408][mllm.training.trainer_common][INFO] - For task: Apply reinforce step
,
            ΔVRAM Allocated: 2.288818359375e-05 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:26,
            Percentage of VRAM taken: 69.91579512747062%,
            
[2025-08-20 08:14:30,876][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/policy_optimizer_state.pt
[2025-08-20 08:14:30,878][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/critic_optimizer_state.pt
[2025-08-20 08:14:32,233][__main__][INFO] - Iteration 2 took 55s (37.34% Gen, 62.66% Train). Generation: 20s, Training: 34s. Estimated remaining time: 15h 17m 36s. Estimated total time: 15h 21m 37s. Time estimates for 10 more iterations: 9m 12s, 100 more iterations: 1h 32m 9s, 500 more iterations: 7h 40m 48s.
[2025-08-20 08:14:32,234][__main__][INFO] - Starting iteration 2.
[2025-08-20 08:14:58,832][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 08:14:58,834][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 08:14:58,840][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 08:15:01,289][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 08:15:01,290][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 08:15:01,297][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 08:15:01,299][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data.
[2025-08-20 08:15:01,299][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets.
[2025-08-20 08:15:01,599][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:15:02,386][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:15:03,177][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:15:03,966][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:15:04,759][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:15:05,549][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:15:06,341][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:15:07,132][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:15:07,923][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:15:08,716][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:15:09,506][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:15:10,300][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:15:11,092][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:15:11,883][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:15:12,678][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:15:13,469][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:15:14,262][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:15:15,058][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:15:15,850][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:15:16,641][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:15:17,436][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:15:18,228][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:15:19,021][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:15:19,816][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:15:20,608][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:15:21,402][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:15:22,774][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:15:23,566][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:15:24,360][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:15:25,158][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:15:25,955][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:15:26,748][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:15:28,361][mllm.training.trainer_common][INFO] - For task: Apply reinforce step
,
            ΔVRAM Allocated: 2.288818359375e-05 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:27,
            Percentage of VRAM taken: 69.91579512747062%,
            
[2025-08-20 08:15:29,335][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/policy_optimizer_state.pt
[2025-08-20 08:15:29,337][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/critic_optimizer_state.pt
[2025-08-20 08:15:30,690][__main__][INFO] - Iteration 3 took 58s (41.34% Gen, 58.66% Train). Generation: 24s, Training: 34s. Estimated remaining time: 16h 9m 16s. Estimated total time: 16h 14m 15s. Time estimates for 10 more iterations: 9m 44s, 100 more iterations: 1h 37m 25s, 500 more iterations: 8h 7m 7s.
[2025-08-20 08:15:30,692][__main__][INFO] - Starting iteration 3.
[2025-08-20 08:15:53,773][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 08:15:53,774][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 08:15:53,781][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 08:15:56,231][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 08:15:56,232][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 08:15:56,239][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 08:15:56,241][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data.
[2025-08-20 08:15:56,241][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets.
[2025-08-20 08:15:56,540][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:15:57,328][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:15:58,120][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:15:58,910][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:15:59,705][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:16:00,496][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:16:01,287][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:16:02,080][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:16:02,872][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:16:03,665][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:16:04,456][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:16:05,248][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:16:06,043][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:16:06,834][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:16:07,628][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:16:08,422][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:16:09,214][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:16:10,006][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:16:10,802][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:16:11,594][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:16:12,386][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:16:13,182][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:16:14,516][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:16:15,310][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:16:16,105][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:16:16,902][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:16:17,697][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:16:18,490][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:16:19,285][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:16:20,082][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:16:20,881][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:16:21,677][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:16:23,288][mllm.training.trainer_common][INFO] - For task: Apply reinforce step
,
            ΔVRAM Allocated: 2.288818359375e-05 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:27,
            Percentage of VRAM taken: 69.91579512747062%,
            
[2025-08-20 08:16:24,242][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/policy_optimizer_state.pt
[2025-08-20 08:16:24,244][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/critic_optimizer_state.pt
[2025-08-20 08:16:25,563][__main__][INFO] - Iteration 4 took 54s (37.61% Gen, 62.39% Train). Generation: 20s, Training: 34s. Estimated remaining time: 15h 8m 35s. Estimated total time: 15h 14m 30s. Time estimates for 10 more iterations: 9m 8s, 100 more iterations: 1h 31m 27s, 500 more iterations: 7h 37m 15s.
[2025-08-20 08:16:25,564][__main__][INFO] - Starting iteration 4.
[2025-08-20 08:16:49,103][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 08:16:49,105][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 08:16:49,111][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 08:16:51,569][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 08:16:51,570][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 08:16:51,576][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 08:16:51,578][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data.
[2025-08-20 08:16:51,579][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets.
[2025-08-20 08:16:51,877][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:16:52,669][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:16:53,461][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:16:54,253][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:16:55,048][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:16:55,841][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:16:56,633][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:16:57,428][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:16:58,222][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:16:59,015][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:16:59,810][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:17:00,606][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:17:01,400][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:17:02,192][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:17:02,988][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:17:03,783][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:17:04,576][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:17:05,369][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:17:06,166][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:17:06,961][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:17:07,754][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:17:08,548][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:17:09,345][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:17:10,578][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:17:11,372][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:17:12,166][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:17:12,962][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:17:13,755][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:17:14,549][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:17:15,346][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:17:16,143][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:17:16,940][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:17:18,574][mllm.training.trainer_common][INFO] - For task: Apply reinforce step
,
            ΔVRAM Allocated: 2.288818359375e-05 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:26,
            Percentage of VRAM taken: 69.91579512747062%,
            
[2025-08-20 08:17:19,534][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/policy_optimizer_state.pt
[2025-08-20 08:17:19,536][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/critic_optimizer_state.pt
[2025-08-20 08:17:21,028][__main__][INFO] - Iteration 5 took 55s (37.99% Gen, 62.00% Train). Generation: 21s, Training: 34s. Estimated remaining time: 15h 17m 33s. Estimated total time: 15h 24m 23s. Time estimates for 10 more iterations: 9m 14s, 100 more iterations: 1h 32m 26s, 500 more iterations: 7h 42m 11s.
[2025-08-20 08:17:21,030][__main__][INFO] - Starting iteration 5.
[2025-08-20 08:17:44,282][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 08:17:44,283][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 08:17:44,290][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 08:17:46,726][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 08:17:46,727][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 08:17:46,733][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 08:17:46,736][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data.
[2025-08-20 08:17:46,736][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets.
[2025-08-20 08:17:47,035][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:17:47,826][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:17:48,616][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:17:49,411][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:17:50,202][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:17:50,993][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:17:51,788][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:17:52,580][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:17:53,372][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:17:54,167][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:17:54,958][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:17:55,752][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:17:56,546][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:17:57,339][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:17:58,132][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:17:58,928][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:17:59,722][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:18:00,514][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:18:01,309][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:18:02,105][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:18:03,300][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:18:04,093][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:18:04,889][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:18:05,685][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:18:06,479][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:18:07,274][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:18:08,069][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:18:08,867][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:18:09,664][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:18:10,460][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:18:11,255][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:18:12,050][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:18:13,645][mllm.training.trainer_common][INFO] - For task: Apply reinforce step
,
            ΔVRAM Allocated: 2.288818359375e-05 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:26,
            Percentage of VRAM taken: 69.91579512747062%,
            
[2025-08-20 08:18:14,726][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/policy_optimizer_state.pt
[2025-08-20 08:18:14,729][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/critic_optimizer_state.pt
[2025-08-20 08:18:16,225][__main__][INFO] - Iteration 6 took 55s (37.68% Gen, 62.31% Train). Generation: 20s, Training: 34s. Estimated remaining time: 15h 12m 9s. Estimated total time: 15h 19m 54s. Time estimates for 10 more iterations: 9m 11s, 100 more iterations: 1h 31m 59s, 500 more iterations: 7h 39m 57s.
[2025-08-20 08:18:16,226][__main__][INFO] - Starting iteration 6.
[2025-08-20 08:18:40,356][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 08:18:40,358][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 08:18:40,364][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 08:18:42,809][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 08:18:42,811][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 08:18:42,817][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 08:18:42,819][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data.
[2025-08-20 08:18:42,820][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets.
[2025-08-20 08:18:43,119][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:18:43,912][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:18:44,703][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:18:45,495][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:18:46,290][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:18:47,082][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:18:47,877][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:18:48,672][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:18:49,467][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:18:50,260][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:18:51,055][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:18:51,850][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:18:52,644][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:18:53,437][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:18:54,232][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:18:55,028][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:18:55,821][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:18:56,614][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:18:57,411][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:18:58,206][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:18:58,999][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:19:00,248][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:19:01,042][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:19:01,835][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:19:02,631][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:19:03,427][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:19:04,223][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:19:05,017][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:19:05,812][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:19:06,610][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:19:07,407][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:19:08,204][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:19:09,798][mllm.training.trainer_common][INFO] - For task: Apply reinforce step
,
            ΔVRAM Allocated: 2.288818359375e-05 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:26,
            Percentage of VRAM taken: 69.91579512747062%,
            
[2025-08-20 08:19:10,714][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/policy_optimizer_state.pt
[2025-08-20 08:19:10,716][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/critic_optimizer_state.pt
[2025-08-20 08:19:12,110][__main__][INFO] - Iteration 7 took 55s (38.82% Gen, 61.18% Train). Generation: 21s, Training: 34s. Estimated remaining time: 15h 22m 42s. Estimated total time: 15h 31m 23s. Time estimates for 10 more iterations: 9m 18s, 100 more iterations: 1h 33m 8s, 500 more iterations: 7h 45m 41s.
[2025-08-20 08:19:12,112][__main__][INFO] - Starting iteration 7.
[2025-08-20 08:19:35,427][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 08:19:35,428][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 08:19:35,435][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 08:19:37,891][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 08:19:37,892][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 08:19:37,899][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 08:19:37,901][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data.
[2025-08-20 08:19:37,901][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets.
[2025-08-20 08:19:38,200][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:19:38,993][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:19:39,786][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:19:40,579][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:19:41,373][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:19:42,165][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:19:42,959][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:19:43,754][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:19:44,547][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:19:45,340][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:19:46,136][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:19:46,931][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:19:47,724][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:19:48,518][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:19:49,315][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:19:50,110][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:19:50,903][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:19:51,697][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:19:52,494][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:19:53,289][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:19:54,082][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:19:54,876][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:19:55,675][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:19:56,472][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:19:57,825][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:19:58,619][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:19:59,416][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:20:00,213][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:20:01,010][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:20:01,805][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:20:02,600][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:20:03,395][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:20:05,031][mllm.training.trainer_common][INFO] - For task: Apply reinforce step
,
            ΔVRAM Allocated: 2.288818359375e-05 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:27,
            Percentage of VRAM taken: 69.91579512747062%,
            
[2025-08-20 08:20:06,043][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/policy_optimizer_state.pt
[2025-08-20 08:20:06,044][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/critic_optimizer_state.pt
[2025-08-20 08:20:07,894][__main__][INFO] - Iteration 8 took 55s (37.41% Gen, 62.58% Train). Generation: 20s, Training: 34s. Estimated remaining time: 15h 20m 5s. Estimated total time: 15h 29m 41s. Time estimates for 10 more iterations: 9m 17s, 100 more iterations: 1h 32m 58s, 500 more iterations: 7h 44m 50s.
[2025-08-20 08:20:07,896][__main__][INFO] - Starting iteration 8.
[2025-08-20 08:20:31,075][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 08:20:31,076][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 08:20:31,082][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 08:20:33,526][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 08:20:33,527][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 08:20:33,534][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 08:20:33,537][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data.
[2025-08-20 08:20:33,537][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets.
[2025-08-20 08:20:33,836][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:20:34,626][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:20:35,419][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:20:36,212][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:20:37,004][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:20:37,799][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:20:38,593][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:20:39,385][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:20:40,180][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:20:40,977][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:20:41,771][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:20:42,565][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:20:43,360][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:20:44,156][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:20:44,950][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:20:45,743][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:20:46,539][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:20:47,745][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:20:48,538][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:20:49,336][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:20:50,133][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:20:50,927][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:20:51,720][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:20:52,517][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:20:53,316][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:20:54,114][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:20:54,911][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:20:55,706][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:20:56,501][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:20:57,297][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:20:58,096][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:20:58,895][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:21:00,564][mllm.training.trainer_common][INFO] - For task: Apply reinforce step
,
            ΔVRAM Allocated: 2.288818359375e-05 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:26,
            Percentage of VRAM taken: 69.91579512747062%,
            
[2025-08-20 08:21:01,651][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/policy_optimizer_state.pt
[2025-08-20 08:21:01,654][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/critic_optimizer_state.pt
[2025-08-20 08:21:03,725][__main__][INFO] - Iteration 9 took 55s (37.14% Gen, 62.86% Train). Generation: 20s, Training: 35s. Estimated remaining time: 15h 19m 56s. Estimated total time: 15h 30m 29s. Time estimates for 10 more iterations: 9m 18s, 100 more iterations: 1h 33m 2s, 500 more iterations: 7h 45m 14s.
[2025-08-20 08:21:03,823][__main__][INFO] - Starting iteration 9.
[2025-08-20 08:21:27,410][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 08:21:27,412][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 08:21:27,418][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 08:21:29,883][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 08:21:29,885][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 08:21:29,891][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 08:21:29,893][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data.
[2025-08-20 08:21:29,894][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets.
[2025-08-20 08:21:30,192][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:21:30,983][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:21:31,776][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:21:32,568][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:21:33,362][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:21:34,157][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:21:34,951][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:21:35,744][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:21:36,539][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:21:37,332][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:21:38,125][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:21:38,921][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:21:39,715][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:21:40,508][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:21:41,302][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:21:42,099][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:21:42,894][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:21:43,687][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:21:44,949][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:21:45,741][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:21:46,536][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:21:47,329][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:21:48,121][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:21:48,918][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:21:49,713][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:21:50,507][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:21:51,300][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:21:52,097][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:21:52,894][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:21:53,688][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:21:54,481][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:21:55,276][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:21:56,878][mllm.training.trainer_common][INFO] - For task: Apply reinforce step
,
            ΔVRAM Allocated: 2.288818359375e-05 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:26,
            Percentage of VRAM taken: 69.91579512747062%,
            
[2025-08-20 08:21:57,826][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/policy_optimizer_state.pt
[2025-08-20 08:21:57,828][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/critic_optimizer_state.pt
[2025-08-20 08:21:59,176][__main__][INFO] - Iteration 10 took 55s (38.15% Gen, 61.84% Train). Generation: 21s, Training: 34s. Estimated remaining time: 15h 11m 3s. Estimated total time: 15h 22m 31s. Time estimates for 10 more iterations: 9m 13s, 100 more iterations: 1h 32m 15s, 500 more iterations: 7h 41m 15s.
[2025-08-20 08:21:59,708][__main__][INFO] - Starting iteration 10.
[2025-08-20 08:22:22,777][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 08:22:22,779][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 08:22:22,785][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 08:22:25,218][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 08:22:25,219][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 08:22:25,226][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 08:22:25,228][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data.
[2025-08-20 08:22:25,228][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets.
[2025-08-20 08:22:25,528][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:22:26,319][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:22:27,111][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:22:27,906][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:22:28,699][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:22:29,491][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:22:30,286][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:22:31,079][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:22:31,873][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:22:32,666][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:22:33,462][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:22:34,256][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:22:35,048][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:22:35,844][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:22:36,640][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:22:37,433][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:22:38,226][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:22:39,023][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:22:39,820][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:22:40,615][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:22:41,408][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:22:42,204][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:22:43,474][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:22:44,268][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:22:45,064][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:22:45,862][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:22:46,660][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:22:47,457][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:22:48,253][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:22:49,047][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:22:49,844][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:22:50,641][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:22:52,255][mllm.training.trainer_common][INFO] - For task: Apply reinforce step
,
            ΔVRAM Allocated: 2.288818359375e-05 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:26,
            Percentage of VRAM taken: 69.91579512747062%,
            
[2025-08-20 08:22:53,204][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/policy_optimizer_state.pt
[2025-08-20 08:22:53,206][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/critic_optimizer_state.pt
[2025-08-20 08:22:54,684][__main__][INFO] - Iteration 11 took 54s (37.51% Gen, 62.49% Train). Generation: 20s, Training: 34s. Estimated remaining time: 15h 3m 51s. Estimated total time: 15h 16m 15s. Time estimates for 10 more iterations: 9m 9s, 100 more iterations: 1h 31m 37s, 500 more iterations: 7h 38m 7s.
[2025-08-20 08:22:54,686][__main__][INFO] - Starting iteration 11.
[2025-08-20 08:23:18,313][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 08:23:18,314][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 08:23:18,320][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 08:23:20,756][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 08:23:20,757][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 08:23:20,763][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 08:23:20,766][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data.
[2025-08-20 08:23:20,766][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets.
[2025-08-20 08:23:21,065][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:23:21,854][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:23:22,648][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:23:23,441][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:23:24,233][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:23:25,028][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:23:25,823][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:23:26,615][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:23:27,409][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:23:28,205][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:23:29,000][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:23:29,793][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:23:30,587][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:23:31,382][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:23:32,175][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:23:32,968][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:23:33,765][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:23:34,561][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:23:35,354][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:23:36,148][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:23:36,945][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:23:38,226][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:23:39,020][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:23:39,815][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:23:40,609][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:23:41,405][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:23:42,204][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:23:43,002][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:23:43,799][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:23:44,594][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:23:45,388][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:23:46,185][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:23:47,798][mllm.training.trainer_common][INFO] - For task: Apply reinforce step
,
            ΔVRAM Allocated: 2.288818359375e-05 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:26,
            Percentage of VRAM taken: 69.91579512747062%,
            
[2025-08-20 08:23:48,744][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/policy_optimizer_state.pt
[2025-08-20 08:23:48,746][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/critic_optimizer_state.pt
[2025-08-20 08:23:50,100][__main__][INFO] - Iteration 12 took 55s (38.25% Gen, 61.75% Train). Generation: 21s, Training: 34s. Estimated remaining time: 15h 10m 14s. Estimated total time: 15h 23m 34s. Time estimates for 10 more iterations: 9m 14s, 100 more iterations: 1h 32m 21s, 500 more iterations: 7h 41m 47s.
[2025-08-20 08:23:50,102][__main__][INFO] - Starting iteration 12.
[2025-08-20 08:24:13,778][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 08:24:13,779][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 08:24:13,785][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 08:24:16,246][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 08:24:16,248][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 08:24:16,254][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 08:24:16,256][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data.
[2025-08-20 08:24:16,257][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets.
[2025-08-20 08:24:16,556][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:24:17,348][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:24:18,140][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:24:18,932][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:24:19,727][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:24:20,519][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:24:21,312][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:24:22,107][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:24:22,900][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:24:23,693][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:24:24,489][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:24:25,283][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:24:26,076][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:24:26,871][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:24:27,667][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:24:28,462][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:24:29,256][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:24:30,050][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:24:30,847][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:24:31,642][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:24:32,436][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:24:33,231][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:24:34,028][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:24:34,826][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:24:35,622][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:24:36,416][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:24:37,649][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:24:38,854][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:24:39,649][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:24:40,446][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:24:41,244][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:24:42,041][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:24:43,653][mllm.training.trainer_common][INFO] - For task: Apply reinforce step
,
            ΔVRAM Allocated: 2.288818359375e-05 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:27,
            Percentage of VRAM taken: 69.91579512747062%,
            
[2025-08-20 08:24:44,617][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/policy_optimizer_state.pt
[2025-08-20 08:24:44,619][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/critic_optimizer_state.pt
[2025-08-20 08:24:45,996][__main__][INFO] - Iteration 13 took 55s (37.97% Gen, 62.03% Train). Generation: 21s, Training: 34s. Estimated remaining time: 15h 17m 19s. Estimated total time: 15h 31m 34s. Time estimates for 10 more iterations: 9m 18s, 100 more iterations: 1h 33m 9s, 500 more iterations: 7h 45m 47s.
[2025-08-20 08:24:45,998][__main__][INFO] - Starting iteration 13.
[2025-08-20 08:25:09,157][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 08:25:09,158][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 08:25:09,164][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 08:25:11,630][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 08:25:11,632][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 08:25:11,638][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 08:25:11,640][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data.
[2025-08-20 08:25:11,641][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets.
[2025-08-20 08:25:11,941][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:25:12,733][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:25:13,527][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:25:14,319][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:25:15,115][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:25:15,910][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:25:16,703][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:25:17,496][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:25:18,292][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:25:19,087][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:25:19,879][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:25:20,674][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:25:21,471][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:25:22,265][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:25:23,059][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:25:23,854][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:25:24,651][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:25:25,446][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:25:26,240][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:25:27,035][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:25:27,833][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:25:28,631][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:25:29,902][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:25:30,696][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:25:31,492][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:25:32,290][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:25:33,089][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:25:33,886][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:25:34,682][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:25:35,476][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:25:36,272][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:25:37,069][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:25:38,677][mllm.training.trainer_common][INFO] - For task: Apply reinforce step
,
            ΔVRAM Allocated: 2.288818359375e-05 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:26,
            Percentage of VRAM taken: 69.91579512747062%,
            
[2025-08-20 08:25:39,672][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/policy_optimizer_state.pt
[2025-08-20 08:25:39,674][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/critic_optimizer_state.pt
[2025-08-20 08:25:41,017][__main__][INFO] - Iteration 14 took 55s (37.63% Gen, 62.36% Train). Generation: 20s, Training: 34s. Estimated remaining time: 15h 1m 48s. Estimated total time: 15h 16m 58s. Time estimates for 10 more iterations: 9m 10s, 100 more iterations: 1h 31m 41s, 500 more iterations: 7h 38m 29s.
[2025-08-20 08:25:41,018][__main__][INFO] - Starting iteration 14.
[2025-08-20 08:26:04,198][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 08:26:04,200][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 08:26:04,206][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 08:26:06,669][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 08:26:06,670][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 08:26:06,676][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 08:26:06,679][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data.
[2025-08-20 08:26:06,679][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets.
[2025-08-20 08:26:06,980][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:26:07,772][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:26:08,565][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:26:09,357][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:26:10,151][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:26:10,943][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:26:11,737][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:26:12,531][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:26:13,324][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:26:14,117][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:26:14,916][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:26:15,711][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:26:16,504][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:26:17,297][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:26:18,566][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:26:19,357][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:26:20,152][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:26:20,946][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:26:21,739][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:26:22,533][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:26:23,328][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:26:24,121][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:26:24,915][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:26:25,712][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:26:26,508][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:26:27,303][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:26:28,097][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:26:28,892][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:26:29,689][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:26:30,484][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:26:31,278][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:26:32,073][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:26:33,693][mllm.training.trainer_common][INFO] - For task: Apply reinforce step
,
            ΔVRAM Allocated: 2.288818359375e-05 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:26,
            Percentage of VRAM taken: 69.91579512747062%,
            
[2025-08-20 08:26:34,654][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/policy_optimizer_state.pt
[2025-08-20 08:26:34,655][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/critic_optimizer_state.pt
[2025-08-20 08:26:35,919][__main__][INFO] - Iteration 15 took 54s (37.75% Gen, 62.25% Train). Generation: 20s, Training: 34s. Estimated remaining time: 14h 58m 55s. Estimated total time: 15h 15m 0s. Time estimates for 10 more iterations: 9m 9s, 100 more iterations: 1h 31m 30s, 500 more iterations: 7h 37m 30s.
[2025-08-20 08:26:35,921][__main__][INFO] - Starting iteration 15.
[2025-08-20 08:26:59,445][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 08:26:59,447][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 08:26:59,453][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 08:27:01,931][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 08:27:01,932][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 08:27:01,939][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 08:27:01,941][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data.
[2025-08-20 08:27:01,941][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets.
[2025-08-20 08:27:02,239][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:27:03,028][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:27:03,820][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:27:04,612][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:27:05,403][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:27:06,197][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:27:06,988][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:27:07,780][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:27:08,575][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:27:09,366][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:27:10,159][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:27:10,954][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:27:11,746][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:27:12,539][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:27:13,335][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:27:14,128][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:27:14,921][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:27:15,716][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:27:16,512][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:27:17,306][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:27:18,099][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:27:18,896][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:27:19,694][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:27:20,997][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:27:21,796][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:27:22,591][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:27:23,387][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:27:24,182][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:27:24,978][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:27:25,775][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:27:26,573][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:27:27,370][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:27:28,964][mllm.training.trainer_common][INFO] - For task: Apply reinforce step
,
            ΔVRAM Allocated: 2.288818359375e-05 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:26,
            Percentage of VRAM taken: 69.91579512747062%,
            
[2025-08-20 08:27:29,933][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/policy_optimizer_state.pt
[2025-08-20 08:27:29,935][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/critic_optimizer_state.pt
[2025-08-20 08:27:31,254][__main__][INFO] - Iteration 16 took 55s (38.03% Gen, 61.97% Train). Generation: 21s, Training: 34s. Estimated remaining time: 15h 5m 12s. Estimated total time: 15h 22m 12s. Time estimates for 10 more iterations: 9m 13s, 100 more iterations: 1h 32m 13s, 500 more iterations: 7h 41m 6s.
[2025-08-20 08:27:31,256][__main__][INFO] - Starting iteration 16.
[2025-08-20 08:27:54,647][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 08:27:54,649][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 08:27:54,655][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 08:27:57,125][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 08:27:57,126][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 08:27:57,134][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 08:27:57,136][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data.
[2025-08-20 08:27:57,137][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets.
[2025-08-20 08:27:57,436][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:27:58,227][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:27:59,021][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:27:59,815][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:28:00,608][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:28:01,401][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:28:02,195][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:28:02,988][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:28:03,782][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:28:04,578][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:28:05,371][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:28:06,163][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:28:06,960][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:28:07,756][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:28:08,550][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:28:09,343][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:28:10,140][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:28:10,936][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:28:11,730][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:28:12,952][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:28:13,746][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:28:14,539][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:28:15,336][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:28:16,130][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:28:16,924][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:28:17,719][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:28:18,517][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:28:19,313][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:28:20,108][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:28:20,901][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:28:21,697][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:28:22,494][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:28:24,109][mllm.training.trainer_common][INFO] - For task: Apply reinforce step
,
            ΔVRAM Allocated: 2.288818359375e-05 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:26,
            Percentage of VRAM taken: 69.91579512747062%,
            
[2025-08-20 08:28:25,056][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/policy_optimizer_state.pt
[2025-08-20 08:28:25,058][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/critic_optimizer_state.pt
[2025-08-20 08:28:28,135][__main__][INFO] - Iteration 17 took 56s (36.80% Gen, 63.19% Train). Generation: 20s, Training: 35s. Estimated remaining time: 15h 30m 1s. Estimated total time: 15h 47m 58s. Time estimates for 10 more iterations: 9m 28s, 100 more iterations: 1h 34m 47s, 500 more iterations: 7h 53m 59s.
[2025-08-20 08:28:28,136][__main__][INFO] - Starting iteration 17.
[2025-08-20 08:28:51,236][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 08:28:51,237][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 08:28:51,244][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 08:28:53,709][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 08:28:53,711][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 08:28:53,717][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 08:28:53,719][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data.
[2025-08-20 08:28:53,720][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets.
[2025-08-20 08:28:54,026][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:28:54,817][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:28:55,608][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:28:56,403][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:28:57,195][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:28:57,987][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:28:58,780][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:28:59,573][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:29:00,365][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:29:01,162][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:29:01,957][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:29:02,750][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:29:03,544][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:29:04,340][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:29:05,135][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:29:05,928][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:29:06,724][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:29:07,522][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:29:08,318][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:29:09,112][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:29:09,906][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:29:10,702][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:29:11,500][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:29:12,791][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:29:13,584][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:29:14,380][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:29:15,178][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:29:15,975][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:29:16,770][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:29:17,564][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:29:18,361][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:29:19,159][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:29:20,759][mllm.training.trainer_common][INFO] - For task: Apply reinforce step
,
            ΔVRAM Allocated: 2.288818359375e-05 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:26,
            Percentage of VRAM taken: 69.91579512747062%,
            
[2025-08-20 08:29:21,703][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/policy_optimizer_state.pt
[2025-08-20 08:29:21,704][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/critic_optimizer_state.pt
[2025-08-20 08:29:23,122][__main__][INFO] - Iteration 18 took 54s (37.57% Gen, 62.43% Train). Generation: 20s, Training: 34s. Estimated remaining time: 14h 57m 32s. Estimated total time: 15h 16m 24s. Time estimates for 10 more iterations: 9m 9s, 100 more iterations: 1h 31m 38s, 500 more iterations: 7h 38m 12s.
[2025-08-20 08:29:23,123][__main__][INFO] - Starting iteration 18.
[2025-08-20 08:29:46,273][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 08:29:46,274][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 08:29:46,280][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 08:29:48,747][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 08:29:48,749][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 08:29:48,755][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 08:29:48,758][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data.
[2025-08-20 08:29:48,758][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets.
[2025-08-20 08:29:49,057][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:29:49,847][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:29:50,640][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:29:51,432][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:29:52,227][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:29:53,021][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:29:53,814][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:29:54,608][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:29:55,404][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:29:56,198][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:29:56,992][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:29:57,787][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:29:58,583][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:29:59,378][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:30:00,173][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:30:00,968][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:30:01,763][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:30:02,558][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:30:03,353][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:30:04,341][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:30:05,136][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:30:05,930][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:30:07,550][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:30:08,345][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:30:09,143][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:30:09,941][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:30:10,738][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:30:11,533][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:30:12,329][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:30:13,126][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:30:13,923][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:30:14,722][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:30:16,347][mllm.training.trainer_common][INFO] - For task: Apply reinforce step
,
            ΔVRAM Allocated: 2.288818359375e-05 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:27,
            Percentage of VRAM taken: 69.91579512747062%,
            
[2025-08-20 08:30:17,301][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/policy_optimizer_state.pt
[2025-08-20 08:30:17,302][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/critic_optimizer_state.pt
[2025-08-20 08:30:18,666][__main__][INFO] - Iteration 19 took 55s (37.28% Gen, 62.71% Train). Generation: 20s, Training: 34s. Estimated remaining time: 15h 5m 54s. Estimated total time: 15h 25m 41s. Time estimates for 10 more iterations: 9m 15s, 100 more iterations: 1h 32m 34s, 500 more iterations: 7h 42m 50s.
[2025-08-20 08:30:18,668][__main__][INFO] - Starting iteration 19.
[2025-08-20 08:30:42,159][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 08:30:42,160][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 08:30:42,166][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 08:30:44,606][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 08:30:44,607][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 08:30:44,614][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 08:30:44,616][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data.
[2025-08-20 08:30:44,617][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets.
[2025-08-20 08:30:44,915][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:30:45,707][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:30:46,499][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:30:47,292][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:30:48,088][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:30:48,881][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:30:49,674][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:30:50,470][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:30:51,267][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:30:52,061][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:30:52,855][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:30:53,651][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:30:54,449][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:30:55,245][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:30:56,040][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:30:56,834][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:30:57,630][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:30:58,872][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:30:59,666][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:31:00,462][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:31:01,257][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:31:02,052][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:31:02,849][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:31:03,647][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:31:04,446][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:31:05,243][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:31:06,039][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:31:06,834][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:31:07,628][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:31:08,426][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:31:09,224][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:31:10,021][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:31:11,639][mllm.training.trainer_common][INFO] - For task: Apply reinforce step
,
            ΔVRAM Allocated: 2.288818359375e-05 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:26,
            Percentage of VRAM taken: 69.91579512747062%,
            
[2025-08-20 08:31:12,579][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/policy_optimizer_state.pt
[2025-08-20 08:31:12,581][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/critic_optimizer_state.pt
[2025-08-20 08:31:13,955][__main__][INFO] - Iteration 20 took 55s (38.08% Gen, 61.92% Train). Generation: 21s, Training: 34s. Estimated remaining time: 15h 0m 44s. Estimated total time: 15h 21m 26s. Time estimates for 10 more iterations: 9m 12s, 100 more iterations: 1h 32m 8s, 500 more iterations: 7h 40m 43s.
[2025-08-20 08:31:13,957][__main__][INFO] - Starting iteration 20.
[2025-08-20 08:31:40,605][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 08:31:40,606][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 08:31:40,613][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 08:31:43,087][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 08:31:43,088][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 08:31:43,095][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 08:31:43,097][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data.
[2025-08-20 08:31:43,097][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets.
[2025-08-20 08:31:43,396][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:31:44,188][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:31:44,979][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:31:45,773][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:31:46,568][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:31:47,361][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:31:48,153][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:31:48,950][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:31:49,744][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:31:50,538][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:31:51,332][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:31:52,129][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:31:52,924][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:31:53,718][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:31:54,512][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:31:55,309][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:31:56,105][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:31:56,899][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:31:57,693][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:31:58,490][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:31:59,288][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:32:00,085][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:32:01,332][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:32:02,128][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:32:02,925][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:32:03,721][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:32:04,515][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:32:05,310][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:32:06,107][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:32:06,905][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:32:07,702][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:32:08,497][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:32:10,062][mllm.training.trainer_common][INFO] - For task: Apply reinforce step
,
            ΔVRAM Allocated: 2.288818359375e-05 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:26,
            Percentage of VRAM taken: 69.91579512747062%,
            
[2025-08-20 08:32:10,996][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/policy_optimizer_state.pt
[2025-08-20 08:32:10,997][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/critic_optimizer_state.pt
[2025-08-20 08:32:12,365][__main__][INFO] - Iteration 21 took 58s (41.38% Gen, 58.62% Train). Generation: 24s, Training: 34s. Estimated remaining time: 15h 51m 46s. Estimated total time: 16h 13m 27s. Time estimates for 10 more iterations: 9m 44s, 100 more iterations: 1h 37m 20s, 500 more iterations: 8h 6m 43s.
[2025-08-20 08:32:12,366][__main__][INFO] - Starting iteration 21.
[2025-08-20 08:32:35,524][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 08:32:35,525][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 08:32:35,531][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 08:32:37,961][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 08:32:37,963][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 08:32:37,969][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 08:32:37,971][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data.
[2025-08-20 08:32:37,971][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets.
[2025-08-20 08:32:38,269][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:32:39,064][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:32:39,857][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:32:40,652][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:32:41,445][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:32:42,237][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:32:43,034][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:32:43,830][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:32:44,623][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:32:45,417][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:32:46,213][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:32:47,009][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:32:47,803][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:32:48,597][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:32:49,394][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:32:50,610][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:32:51,405][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:32:52,200][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:32:52,996][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:32:53,792][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:32:54,591][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:32:55,388][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:32:56,671][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:32:57,465][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:32:58,260][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:32:59,055][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:32:59,852][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:33:00,651][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:33:01,450][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:33:02,247][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:33:03,043][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:33:03,836][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:33:05,427][mllm.training.trainer_common][INFO] - For task: Apply reinforce step
,
            ΔVRAM Allocated: 2.288818359375e-05 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:27,
            Percentage of VRAM taken: 69.91579512747062%,
            
[2025-08-20 08:33:06,373][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/policy_optimizer_state.pt
[2025-08-20 08:33:06,375][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/critic_optimizer_state.pt
[2025-08-20 08:33:07,701][__main__][INFO] - Iteration 22 took 55s (37.47% Gen, 62.53% Train). Generation: 20s, Training: 34s. Estimated remaining time: 14h 59m 37s. Estimated total time: 15h 22m 13s. Time estimates for 10 more iterations: 9m 13s, 100 more iterations: 1h 32m 13s, 500 more iterations: 7h 41m 6s.
[2025-08-20 08:33:07,702][__main__][INFO] - Starting iteration 22.
[2025-08-20 08:33:30,823][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 08:33:30,825][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 08:33:30,831][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 08:33:33,284][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 08:33:33,285][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 08:33:33,292][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 08:33:33,294][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data.
[2025-08-20 08:33:33,295][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets.
[2025-08-20 08:33:33,594][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:33:34,385][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:33:35,178][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:33:35,972][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:33:36,766][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:33:37,561][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:33:38,357][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:33:39,154][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:33:39,948][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:33:40,742][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:33:41,538][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:33:42,336][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:33:43,133][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:33:43,928][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:33:44,723][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:33:45,518][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:33:46,317][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:33:47,115][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:33:47,912][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:33:48,708][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:33:49,988][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:33:50,781][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:33:51,576][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:33:52,373][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:33:53,171][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:33:53,967][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:33:54,762][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:33:55,557][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:33:56,354][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:33:57,152][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:33:57,949][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:33:58,744][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:34:00,328][mllm.training.trainer_common][INFO] - For task: Apply reinforce step
,
            ΔVRAM Allocated: 2.288818359375e-05 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:26,
            Percentage of VRAM taken: 69.91579512747062%,
            
[2025-08-20 08:34:01,354][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/policy_optimizer_state.pt
[2025-08-20 08:34:01,356][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/critic_optimizer_state.pt
[2025-08-20 08:34:02,824][__main__][INFO] - Iteration 23 took 55s (37.51% Gen, 62.49% Train). Generation: 20s, Training: 34s. Estimated remaining time: 14h 55m 9s. Estimated total time: 15h 18m 41s. Time estimates for 10 more iterations: 9m 11s, 100 more iterations: 1h 31m 52s, 500 more iterations: 7h 39m 20s.
[2025-08-20 08:34:02,825][__main__][INFO] - Starting iteration 23.
[2025-08-20 08:34:26,000][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 08:34:26,001][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 08:34:26,008][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 08:34:28,449][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 08:34:28,450][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 08:34:28,457][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 08:34:28,459][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data.
[2025-08-20 08:34:28,460][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets.
[2025-08-20 08:34:28,758][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:34:29,549][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:34:30,342][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:34:31,139][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:34:31,933][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:34:32,726][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:34:33,521][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:34:34,318][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:34:35,115][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:34:35,908][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:34:36,703][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:34:37,500][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:34:38,298][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:34:39,093][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:34:39,888][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:34:40,682][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:34:41,480][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:34:42,278][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:34:43,075][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:34:43,872][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:34:44,668][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:34:45,463][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:34:46,689][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:34:47,484][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:34:48,279][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:34:49,076][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:34:49,874][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:34:50,671][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:34:51,466][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:34:52,258][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:34:53,054][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:34:53,850][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:34:55,453][mllm.training.trainer_common][INFO] - For task: Apply reinforce step
,
            ΔVRAM Allocated: 2.288818359375e-05 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:26,
            Percentage of VRAM taken: 69.91579512747062%,
            
[2025-08-20 08:34:56,374][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/policy_optimizer_state.pt
[2025-08-20 08:34:56,375][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/critic_optimizer_state.pt
[2025-08-20 08:34:57,723][__main__][INFO] - Iteration 24 took 54s (37.78% Gen, 62.22% Train). Generation: 20s, Training: 34s. Estimated remaining time: 14h 50m 30s. Estimated total time: 15h 14m 57s. Time estimates for 10 more iterations: 9m 8s, 100 more iterations: 1h 31m 29s, 500 more iterations: 7h 37m 28s.
[2025-08-20 08:34:57,725][__main__][INFO] - Starting iteration 24.
[2025-08-20 08:35:21,677][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 08:35:21,679][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 08:35:21,685][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 08:35:24,159][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 08:35:24,160][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 08:35:24,167][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 08:35:24,169][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data.
[2025-08-20 08:35:24,169][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets.
[2025-08-20 08:35:24,468][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:35:25,259][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:35:26,048][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:35:26,840][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:35:27,632][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:35:28,424][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:35:29,219][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:35:30,011][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:35:30,803][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:35:31,599][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:35:32,391][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:35:33,184][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:35:33,980][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:35:34,775][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:35:35,568][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:35:36,362][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:35:37,613][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:35:38,407][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:35:39,202][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:35:40,000][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:35:40,797][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:35:41,593][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:35:42,388][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:35:43,184][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:35:43,981][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:35:44,779][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:35:45,577][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:35:46,374][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:35:47,168][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:35:47,961][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:35:48,757][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:35:49,554][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:35:51,120][mllm.training.trainer_common][INFO] - For task: Apply reinforce step
,
            ΔVRAM Allocated: 2.288818359375e-05 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:26,
            Percentage of VRAM taken: 69.91579512747062%,
            
[2025-08-20 08:35:52,029][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/policy_optimizer_state.pt
[2025-08-20 08:35:52,031][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/critic_optimizer_state.pt
[2025-08-20 08:35:53,280][__main__][INFO] - Iteration 25 took 55s (38.69% Gen, 61.31% Train). Generation: 21s, Training: 34s. Estimated remaining time: 15h 0m 32s. Estimated total time: 15h 25m 55s. Time estimates for 10 more iterations: 9m 15s, 100 more iterations: 1h 32m 35s, 500 more iterations: 7h 42m 57s.
[2025-08-20 08:35:53,282][__main__][INFO] - Starting iteration 25.
[2025-08-20 08:36:16,833][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 08:36:16,834][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 08:36:16,841][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 08:36:19,302][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 08:36:19,303][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 08:36:19,310][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 08:36:19,312][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data.
[2025-08-20 08:36:19,313][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets.
[2025-08-20 08:36:19,611][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:36:20,404][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:36:21,197][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:36:21,990][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:36:22,786][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:36:23,582][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:36:24,376][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:36:25,171][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:36:25,967][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:36:26,764][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:36:27,560][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:36:28,354][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:36:29,149][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:36:29,945][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:36:30,743][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:36:31,540][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:36:32,335][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:36:33,129][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:36:33,925][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:36:34,722][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:36:35,981][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:36:36,775][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:36:37,569][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:36:38,365][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:36:39,163][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:36:39,961][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:36:40,757][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:36:41,549][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:36:42,342][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:36:43,140][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:36:43,935][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:36:44,730][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:36:46,317][mllm.training.trainer_common][INFO] - For task: Apply reinforce step
,
            ΔVRAM Allocated: 2.288818359375e-05 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:26,
            Percentage of VRAM taken: 69.91579512747062%,
            
[2025-08-20 08:36:47,426][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/policy_optimizer_state.pt
[2025-08-20 08:36:47,428][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/critic_optimizer_state.pt
[2025-08-20 08:36:48,720][__main__][INFO] - Iteration 26 took 55s (38.08% Gen, 61.91% Train). Generation: 21s, Training: 34s. Estimated remaining time: 14h 57m 39s. Estimated total time: 15h 23m 57s. Time estimates for 10 more iterations: 9m 14s, 100 more iterations: 1h 32m 23s, 500 more iterations: 7h 41m 58s.
[2025-08-20 08:36:48,721][__main__][INFO] - Starting iteration 26.
[2025-08-20 08:37:11,812][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 08:37:11,813][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 08:37:11,820][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 08:37:14,267][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 08:37:14,269][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 08:37:14,275][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 08:37:14,278][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data.
[2025-08-20 08:37:14,278][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets.
[2025-08-20 08:37:14,582][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:37:15,373][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:37:16,168][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:37:16,962][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:37:17,755][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:37:18,549][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:37:19,347][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:37:20,142][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:37:20,936][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:37:21,729][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:37:22,527][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:37:23,323][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:37:24,117][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:37:24,911][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:37:25,706][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:37:26,502][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:37:27,297][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:37:28,092][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:37:28,888][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:37:29,685][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:37:30,907][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:37:31,703][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:37:32,499][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:37:33,295][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:37:34,090][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:37:34,885][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:37:35,683][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:37:36,479][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:37:37,273][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:37:38,066][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:37:38,861][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:37:39,655][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:37:41,214][mllm.training.trainer_common][INFO] - For task: Apply reinforce step
,
            ΔVRAM Allocated: 2.288818359375e-05 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:26,
            Percentage of VRAM taken: 69.91579512747062%,
            
[2025-08-20 08:37:42,133][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/policy_optimizer_state.pt
[2025-08-20 08:37:42,135][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/critic_optimizer_state.pt
[2025-08-20 08:37:43,549][__main__][INFO] - Iteration 27 took 54s (37.65% Gen, 62.35% Train). Generation: 20s, Training: 34s. Estimated remaining time: 14h 46m 35s. Estimated total time: 15h 13m 47s. Time estimates for 10 more iterations: 9m 8s, 100 more iterations: 1h 31m 22s, 500 more iterations: 7h 36m 53s.
[2025-08-20 08:37:43,551][__main__][INFO] - Starting iteration 27.
[2025-08-20 08:38:07,630][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 08:38:07,632][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 08:38:07,638][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 08:38:10,086][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 08:38:10,087][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 08:38:10,094][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 08:38:10,096][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data.
[2025-08-20 08:38:10,097][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets.
[2025-08-20 08:38:10,396][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:38:11,190][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:38:11,983][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:38:12,776][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:38:13,572][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:38:14,369][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:38:15,164][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:38:15,958][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:38:16,753][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:38:17,550][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:38:18,345][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:38:19,140][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:38:19,934][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:38:20,730][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:38:21,529][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:38:22,327][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:38:23,124][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:38:23,919][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:38:24,713][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:38:25,510][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:38:26,308][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:38:27,106][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:38:27,903][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:38:28,698][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:38:29,492][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:38:30,287][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:38:31,085][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:38:31,880][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:38:33,141][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:38:33,935][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:38:34,729][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:38:35,525][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:38:37,083][mllm.training.trainer_common][INFO] - For task: Apply reinforce step
,
            ΔVRAM Allocated: 2.288818359375e-05 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:26,
            Percentage of VRAM taken: 69.91579512747062%,
            
[2025-08-20 08:38:38,023][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/policy_optimizer_state.pt
[2025-08-20 08:38:38,024][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/critic_optimizer_state.pt
[2025-08-20 08:38:39,312][__main__][INFO] - Iteration 28 took 55s (38.77% Gen, 61.23% Train). Generation: 21s, Training: 34s. Estimated remaining time: 15h 1m 12s. Estimated total time: 15h 29m 20s. Time estimates for 10 more iterations: 9m 17s, 100 more iterations: 1h 32m 56s, 500 more iterations: 7h 44m 40s.
[2025-08-20 08:38:39,313][__main__][INFO] - Starting iteration 28.
[2025-08-20 08:39:02,669][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 08:39:02,671][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 08:39:02,677][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 08:39:05,149][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 08:39:05,151][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 08:39:05,157][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 08:39:05,159][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data.
[2025-08-20 08:39:05,160][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets.
[2025-08-20 08:39:05,463][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:39:06,254][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:39:07,048][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:39:07,841][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:39:08,636][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:39:09,432][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:39:10,227][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:39:11,020][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:39:11,815][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:39:12,612][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:39:13,408][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:39:14,203][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:39:14,996][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:39:15,792][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:39:16,588][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:39:17,382][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:39:18,176][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:39:18,972][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:39:19,768][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:39:20,563][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:39:21,357][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:39:22,153][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:39:22,951][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:39:23,748][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:39:25,047][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:39:25,841][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:39:26,634][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:39:27,429][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:39:28,225][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:39:29,019][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:39:29,812][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:39:30,608][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:39:32,159][mllm.training.trainer_common][INFO] - For task: Apply reinforce step
,
            ΔVRAM Allocated: 2.288818359375e-05 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:26,
            Percentage of VRAM taken: 69.91579512747062%,
            
[2025-08-20 08:39:33,091][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/policy_optimizer_state.pt
[2025-08-20 08:39:33,092][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/critic_optimizer_state.pt
[2025-08-20 08:39:34,458][__main__][INFO] - Iteration 29 took 55s (37.87% Gen, 62.12% Train). Generation: 20s, Training: 34s. Estimated remaining time: 14h 50m 0s. Estimated total time: 15h 19m 3s. Time estimates for 10 more iterations: 9m 11s, 100 more iterations: 1h 31m 54s, 500 more iterations: 7h 39m 31s.
[2025-08-20 08:39:34,459][__main__][INFO] - Starting iteration 29.
[2025-08-20 08:39:57,695][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 08:39:57,697][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 08:39:57,703][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 08:40:00,160][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 08:40:00,162][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 08:40:00,168][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 08:40:00,170][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data.
[2025-08-20 08:40:00,171][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets.
[2025-08-20 08:40:00,475][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:40:01,266][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:40:02,059][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:40:02,855][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:40:03,649][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:40:04,442][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:40:05,238][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:40:06,035][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:40:06,831][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:40:07,624][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:40:08,419][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:40:09,216][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:40:10,014][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:40:10,809][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:40:11,603][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:40:12,397][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:40:13,195][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:40:13,992][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:40:15,219][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:40:16,015][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:40:16,813][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:40:17,611][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:40:18,406][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:40:19,199][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:40:19,993][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:40:20,789][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:40:21,583][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:40:22,375][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:40:23,170][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:40:23,965][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:40:24,759][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:40:25,553][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:40:27,120][mllm.training.trainer_common][INFO] - For task: Apply reinforce step
,
            ΔVRAM Allocated: 2.288818359375e-05 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:26,
            Percentage of VRAM taken: 69.91579512747062%,
            
[2025-08-20 08:40:28,207][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/policy_optimizer_state.pt
[2025-08-20 08:40:28,209][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/critic_optimizer_state.pt
[2025-08-20 08:40:29,531][__main__][INFO] - Iteration 30 took 55s (37.72% Gen, 62.28% Train). Generation: 20s, Training: 34s. Estimated remaining time: 14h 47m 53s. Estimated total time: 15h 17m 51s. Time estimates for 10 more iterations: 9m 10s, 100 more iterations: 1h 31m 47s, 500 more iterations: 7h 38m 55s.
[2025-08-20 08:40:29,533][__main__][INFO] - Starting iteration 30.
[2025-08-20 08:40:53,072][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 08:40:53,074][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 08:40:53,080][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 08:40:55,524][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 08:40:55,525][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 08:40:55,532][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 08:40:55,534][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data.
[2025-08-20 08:40:55,535][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets.
[2025-08-20 08:40:55,834][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:40:56,625][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:40:57,420][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:40:58,216][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:40:59,010][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:40:59,804][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:41:00,600][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:41:01,396][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:41:02,191][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:41:02,984][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:41:03,780][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:41:04,577][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:41:05,373][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:41:06,168][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:41:06,963][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:41:07,759][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:41:08,558][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:41:09,355][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:41:10,153][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:41:10,948][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:41:11,743][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:41:12,539][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:41:13,336][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:41:14,132][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:41:14,925][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:41:16,172][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:41:16,966][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:41:17,760][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:41:18,555][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:41:19,352][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:41:20,148][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:41:20,942][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:41:22,513][mllm.training.trainer_common][INFO] - For task: Apply reinforce step
,
            ΔVRAM Allocated: 2.288818359375e-05 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:26,
            Percentage of VRAM taken: 69.91579512747062%,
            
[2025-08-20 08:41:23,559][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/policy_optimizer_state.pt
[2025-08-20 08:41:23,561][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/critic_optimizer_state.pt
[2025-08-20 08:41:24,844][__main__][INFO] - Iteration 31 took 55s (38.12% Gen, 61.88% Train). Generation: 21s, Training: 34s. Estimated remaining time: 14h 50m 56s. Estimated total time: 15h 21m 50s. Time estimates for 10 more iterations: 9m 13s, 100 more iterations: 1h 32m 11s, 500 more iterations: 7h 40m 55s.
[2025-08-20 08:41:24,846][__main__][INFO] - Starting iteration 31.
[2025-08-20 08:41:47,997][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 08:41:47,999][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 08:41:48,005][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 08:41:50,458][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 08:41:50,459][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 08:41:50,466][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 08:41:50,469][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data.
[2025-08-20 08:41:50,469][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets.
[2025-08-20 08:41:50,772][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:41:51,563][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:41:52,357][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:41:53,150][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:41:53,944][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:41:54,740][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:41:55,535][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:41:56,328][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:41:57,124][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:41:57,921][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:41:58,718][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:41:59,512][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:42:00,306][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:42:01,103][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:42:01,901][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:42:02,698][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:42:03,493][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:42:04,288][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:42:05,085][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:42:05,881][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:42:06,679][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:42:07,981][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:42:08,777][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:42:09,572][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:42:10,365][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:42:11,159][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:42:11,956][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:42:12,751][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:42:13,544][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:42:14,339][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:42:15,137][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:42:15,933][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:42:17,512][mllm.training.trainer_common][INFO] - For task: Apply reinforce step
,
            ΔVRAM Allocated: 2.288818359375e-05 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:27,
            Percentage of VRAM taken: 69.91579512747062%,
            
[2025-08-20 08:42:18,464][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/policy_optimizer_state.pt
[2025-08-20 08:42:18,465][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/critic_optimizer_state.pt
[2025-08-20 08:42:19,725][__main__][INFO] - Iteration 32 took 54s (37.72% Gen, 62.28% Train). Generation: 20s, Training: 34s. Estimated remaining time: 14h 42m 49s. Estimated total time: 15h 14m 38s. Time estimates for 10 more iterations: 9m 8s, 100 more iterations: 1h 31m 27s, 500 more iterations: 7h 37m 19s.
[2025-08-20 08:42:19,726][__main__][INFO] - Starting iteration 32.
[2025-08-20 08:42:42,888][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 08:42:42,890][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 08:42:42,896][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 08:42:45,354][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 08:42:45,355][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 08:42:45,362][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 08:42:45,364][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data.
[2025-08-20 08:42:45,365][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets.
[2025-08-20 08:42:45,663][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:42:46,454][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:42:47,248][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:42:48,044][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:42:48,839][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:42:49,632][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:42:50,426][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:42:51,223][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:42:52,019][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:42:52,812][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:42:53,607][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:42:54,403][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:42:55,200][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:42:55,994][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:42:56,788][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:42:57,585][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:42:58,381][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:42:59,177][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:42:59,971][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:43:00,766][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:43:01,564][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:43:02,361][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:43:03,568][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:43:04,362][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:43:05,159][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:43:05,954][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:43:06,746][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:43:07,541][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:43:08,338][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:43:09,133][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:43:09,927][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:43:10,722][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:43:12,270][mllm.training.trainer_common][INFO] - For task: Apply reinforce step
,
            ΔVRAM Allocated: 2.288818359375e-05 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:26,
            Percentage of VRAM taken: 69.91579512747062%,
            
[2025-08-20 08:43:13,262][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/policy_optimizer_state.pt
[2025-08-20 08:43:13,264][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/critic_optimizer_state.pt
[2025-08-20 08:43:15,676][__main__][INFO] - Iteration 33 took 55s (37.02% Gen, 62.98% Train). Generation: 20s, Training: 35s. Estimated remaining time: 14h 59m 44s. Estimated total time: 15h 32m 29s. Time estimates for 10 more iterations: 9m 19s, 100 more iterations: 1h 33m 14s, 500 more iterations: 7h 46m 14s.
[2025-08-20 08:43:15,678][__main__][INFO] - Starting iteration 33.
[2025-08-20 08:43:38,899][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 08:43:38,900][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 08:43:38,907][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 08:43:41,368][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 08:43:41,369][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 08:43:41,375][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 08:43:41,378][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data.
[2025-08-20 08:43:41,378][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets.
[2025-08-20 08:43:41,678][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:43:42,470][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:43:43,265][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:43:44,058][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:43:44,850][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:43:45,647][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:43:46,442][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:43:47,235][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:43:48,030][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:43:48,826][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:43:49,622][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:43:50,416][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:43:51,210][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:43:52,006][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:43:52,802][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:43:53,596][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:43:54,391][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:43:55,188][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:43:55,986][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:43:56,783][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:43:57,579][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:43:58,372][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:43:59,165][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:43:59,962][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:44:00,756][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:44:01,549][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:44:02,808][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:44:03,602][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:44:04,395][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:44:05,188][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:44:05,982][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:44:06,777][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:44:08,337][mllm.training.trainer_common][INFO] - For task: Apply reinforce step
,
            ΔVRAM Allocated: 2.288818359375e-05 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:26,
            Percentage of VRAM taken: 69.91579512747062%,
            
[2025-08-20 08:44:09,285][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/policy_optimizer_state.pt
[2025-08-20 08:44:09,286][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/critic_optimizer_state.pt
[2025-08-20 08:44:10,578][__main__][INFO] - Iteration 34 took 54s (37.81% Gen, 62.19% Train). Generation: 20s, Training: 34s. Estimated remaining time: 14h 41m 20s. Estimated total time: 15h 14m 59s. Time estimates for 10 more iterations: 9m 8s, 100 more iterations: 1h 31m 29s, 500 more iterations: 7h 37m 29s.
[2025-08-20 08:44:10,580][__main__][INFO] - Starting iteration 34.
[2025-08-20 08:44:33,691][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 08:44:33,693][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 08:44:33,699][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 08:44:36,149][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 08:44:36,151][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 08:44:36,157][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 08:44:36,159][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data.
[2025-08-20 08:44:36,160][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets.
[2025-08-20 08:44:36,458][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:44:37,250][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:44:38,044][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:44:38,836][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:44:39,631][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:44:40,427][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:44:41,222][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:44:42,015][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:44:42,811][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:44:43,607][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:44:44,402][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:44:45,196][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:44:45,991][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:44:46,788][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:44:47,584][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:44:48,378][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:44:49,173][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:44:49,970][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:44:50,768][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:44:51,966][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:44:52,760][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:44:53,552][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:44:54,346][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:44:55,142][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:44:55,935][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:44:56,729][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:44:57,526][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:44:58,322][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:44:59,116][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:44:59,909][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:45:00,705][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:45:01,501][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:45:03,081][mllm.training.trainer_common][INFO] - For task: Apply reinforce step
,
            ΔVRAM Allocated: 2.288818359375e-05 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:26,
            Percentage of VRAM taken: 69.91579512747062%,
            
[2025-08-20 08:45:04,026][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/policy_optimizer_state.pt
[2025-08-20 08:45:04,028][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/critic_optimizer_state.pt
[2025-08-20 08:45:05,500][__main__][INFO] - Iteration 35 took 54s (37.63% Gen, 62.37% Train). Generation: 20s, Training: 34s. Estimated remaining time: 14h 40m 45s. Estimated total time: 15h 15m 19s. Time estimates for 10 more iterations: 9m 9s, 100 more iterations: 1h 31m 31s, 500 more iterations: 7h 37m 39s.
[2025-08-20 08:45:05,501][__main__][INFO] - Starting iteration 35.
[2025-08-20 08:45:29,052][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 08:45:29,053][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 08:45:29,060][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 08:45:31,528][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 08:45:31,529][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 08:45:31,535][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 08:45:31,537][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data.
[2025-08-20 08:45:31,539][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets.
[2025-08-20 08:45:31,838][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:45:32,631][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:45:33,424][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:45:34,217][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:45:35,013][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:45:35,809][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:45:36,602][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:45:37,396][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:45:38,193][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:45:38,988][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:45:39,782][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:45:40,576][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:45:41,373][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:45:42,170][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:45:42,965][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:45:43,759][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:45:44,553][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:45:45,350][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:45:46,147][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:45:46,940][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:45:47,733][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:45:48,959][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:45:49,751][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:45:50,546][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:45:51,338][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:45:52,131][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:45:52,927][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:45:53,722][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:45:54,514][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:45:55,309][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:45:56,106][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:45:56,901][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:45:58,452][mllm.training.trainer_common][INFO] - For task: Apply reinforce step
,
            ΔVRAM Allocated: 2.288818359375e-05 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:26,
            Percentage of VRAM taken: 69.91579512747062%,
            
[2025-08-20 08:45:59,437][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/policy_optimizer_state.pt
[2025-08-20 08:45:59,438][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/critic_optimizer_state.pt
[2025-08-20 08:46:00,821][__main__][INFO] - Iteration 36 took 55s (38.14% Gen, 61.86% Train). Generation: 21s, Training: 34s. Estimated remaining time: 14h 46m 28s. Estimated total time: 15h 21m 58s. Time estimates for 10 more iterations: 9m 13s, 100 more iterations: 1h 32m 11s, 500 more iterations: 7h 40m 59s.
[2025-08-20 08:46:00,823][__main__][INFO] - Starting iteration 36.
[2025-08-20 08:46:24,801][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 08:46:24,802][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 08:46:24,808][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 08:46:27,277][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 08:46:27,278][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 08:46:27,285][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 08:46:27,288][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data.
[2025-08-20 08:46:27,288][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets.
[2025-08-20 08:46:27,589][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:46:28,379][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:46:29,176][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:46:29,971][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:46:30,765][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:46:31,559][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:46:32,355][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:46:33,151][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:46:33,945][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:46:34,739][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:46:35,536][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:46:36,334][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:46:37,130][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:46:37,924][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:46:38,718][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:46:39,514][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:46:40,312][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:46:41,108][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:46:41,902][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:46:42,695][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:46:43,970][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:46:44,765][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:46:45,557][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:46:46,353][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:46:47,150][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:46:47,944][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:46:48,738][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:46:49,532][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:46:50,329][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:46:51,126][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:46:51,920][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:46:52,714][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:46:54,260][mllm.training.trainer_common][INFO] - For task: Apply reinforce step
,
            ΔVRAM Allocated: 2.288818359375e-05 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:26,
            Percentage of VRAM taken: 69.91579512747062%,
            
[2025-08-20 08:46:55,375][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/policy_optimizer_state.pt
[2025-08-20 08:46:55,378][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/critic_optimizer_state.pt
[2025-08-20 08:46:58,459][__main__][INFO] - Iteration 37 took 57s (37.32% Gen, 62.68% Train). Generation: 21s, Training: 36s. Estimated remaining time: 15h 24m 8s. Estimated total time: 16h 0m 35s. Time estimates for 10 more iterations: 9m 36s, 100 more iterations: 1h 36m 3s, 500 more iterations: 8h 0m 17s.
[2025-08-20 08:46:58,460][__main__][INFO] - Starting iteration 37.
[2025-08-20 08:47:21,685][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 08:47:21,687][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 08:47:21,693][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 08:47:24,152][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 08:47:24,154][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 08:47:24,160][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 08:47:24,163][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data.
[2025-08-20 08:47:24,163][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets.
[2025-08-20 08:47:24,461][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:47:25,255][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:47:26,047][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:47:26,840][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:47:27,638][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:47:28,432][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:47:29,225][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:47:30,019][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:47:30,816][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:47:31,611][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:47:32,406][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:47:33,200][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:47:33,997][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:47:34,794][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:47:35,589][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:47:36,382][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:47:37,177][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:47:37,974][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:47:38,769][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:47:39,561][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:47:40,356][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:47:41,566][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:47:42,360][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:47:43,156][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:47:43,953][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:47:44,748][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:47:45,542][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:47:46,336][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:47:47,133][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:47:47,930][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:47:48,726][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:47:49,520][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:47:51,074][mllm.training.trainer_common][INFO] - For task: Apply reinforce step
,
            ΔVRAM Allocated: 2.288818359375e-05 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:26,
            Percentage of VRAM taken: 69.91579512747062%,
            
[2025-08-20 08:47:52,019][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/policy_optimizer_state.pt
[2025-08-20 08:47:52,020][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/critic_optimizer_state.pt
[2025-08-20 08:47:54,809][__main__][INFO] - Iteration 38 took 56s (36.87% Gen, 63.13% Train). Generation: 20s, Training: 35s. Estimated remaining time: 15h 1m 44s. Estimated total time: 15h 39m 8s. Time estimates for 10 more iterations: 9m 23s, 100 more iterations: 1h 33m 54s, 500 more iterations: 7h 49m 34s.
[2025-08-20 08:47:54,811][__main__][INFO] - Starting iteration 38.
[2025-08-20 08:48:18,259][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 08:48:18,261][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 08:48:18,267][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 08:48:20,723][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 08:48:20,725][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 08:48:20,731][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 08:48:20,733][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data.
[2025-08-20 08:48:20,734][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets.
[2025-08-20 08:48:21,032][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:48:21,820][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:48:22,611][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:48:23,401][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:48:24,194][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:48:24,984][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:48:25,778][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:48:26,569][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:48:27,360][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:48:28,154][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:48:28,946][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:48:29,739][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:48:30,533][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:48:31,325][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:48:32,119][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:48:32,913][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:48:33,703][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:48:34,495][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:48:35,286][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:48:36,077][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:48:37,314][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:48:38,108][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:48:38,901][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:48:39,697][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:48:40,493][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:48:41,288][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:48:42,080][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:48:42,876][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:48:43,673][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:48:44,467][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:48:45,261][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:48:46,055][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:48:47,621][mllm.training.trainer_common][INFO] - For task: Apply reinforce step
,
            ΔVRAM Allocated: 2.288818359375e-05 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:26,
            Percentage of VRAM taken: 69.91579512747062%,
            
[2025-08-20 08:48:48,639][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/policy_optimizer_state.pt
[2025-08-20 08:48:48,641][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/critic_optimizer_state.pt
[2025-08-20 08:48:50,163][__main__][INFO] - Iteration 39 took 55s (37.93% Gen, 62.07% Train). Generation: 20s, Training: 34s. Estimated remaining time: 14h 44m 12s. Estimated total time: 15h 22m 31s. Time estimates for 10 more iterations: 9m 13s, 100 more iterations: 1h 32m 15s, 500 more iterations: 7h 41m 15s.
[2025-08-20 08:48:50,164][__main__][INFO] - Starting iteration 39.
[2025-08-20 08:49:13,362][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 08:49:13,363][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 08:49:13,369][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 08:49:15,836][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 08:49:15,837][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 08:49:15,845][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 08:49:15,848][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data.
[2025-08-20 08:49:15,848][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets.
[2025-08-20 08:49:16,147][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:49:16,941][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:49:17,734][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:49:18,526][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:49:19,322][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:49:20,118][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:49:20,912][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:49:21,705][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:49:22,504][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:49:23,301][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:49:24,097][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:49:24,891][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:49:25,686][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:49:26,482][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:49:27,279][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:49:28,072][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:49:28,864][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:49:29,660][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:49:30,902][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:49:31,697][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:49:32,490][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:49:33,283][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:49:34,079][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:49:34,876][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:49:35,670][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:49:36,463][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:49:37,258][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:49:38,054][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:49:38,848][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:49:39,641][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:49:40,437][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:49:41,233][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:49:42,821][mllm.training.trainer_common][INFO] - For task: Apply reinforce step
,
            ΔVRAM Allocated: 2.288818359375e-05 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:26,
            Percentage of VRAM taken: 69.91579512747062%,
            
[2025-08-20 08:49:43,811][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/policy_optimizer_state.pt
[2025-08-20 08:49:43,813][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/critic_optimizer_state.pt
[2025-08-20 08:49:45,250][__main__][INFO] - Iteration 40 took 55s (37.59% Gen, 62.41% Train). Generation: 20s, Training: 34s. Estimated remaining time: 14h 38m 51s. Estimated total time: 15h 18m 5s. Time estimates for 10 more iterations: 9m 10s, 100 more iterations: 1h 31m 48s, 500 more iterations: 7h 39m 2s.
[2025-08-20 08:49:45,252][__main__][INFO] - Starting iteration 40.
[2025-08-20 08:50:08,844][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 08:50:08,845][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 08:50:08,851][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 08:50:11,318][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 08:50:11,319][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 08:50:11,326][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 08:50:11,328][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data.
[2025-08-20 08:50:11,329][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets.
[2025-08-20 08:50:11,627][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:50:12,419][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:50:13,211][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:50:14,007][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:50:14,802][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:50:15,596][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:50:16,389][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:50:17,185][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:50:17,981][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:50:18,775][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:50:19,569][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:50:20,365][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:50:21,162][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:50:21,956][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:50:22,936][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:50:23,923][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:50:24,714][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:50:25,504][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:50:26,298][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:50:27,090][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:50:27,882][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:50:28,675][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:50:29,467][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:50:30,260][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:50:31,054][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:50:31,845][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:50:33,043][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:50:33,835][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:50:34,627][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:50:35,420][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:50:36,214][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:50:37,006][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:50:38,558][mllm.training.trainer_common][INFO] - For task: Apply reinforce step
,
            ΔVRAM Allocated: 2.288818359375e-05 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:27,
            Percentage of VRAM taken: 69.91579512747062%,
            
[2025-08-20 08:50:39,525][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/policy_optimizer_state.pt
[2025-08-20 08:50:39,530][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/critic_optimizer_state.pt
[2025-08-20 08:50:40,978][__main__][INFO] - Iteration 41 took 55s (37.94% Gen, 62.06% Train). Generation: 21s, Training: 34s. Estimated remaining time: 14h 48m 35s. Estimated total time: 15h 28m 45s. Time estimates for 10 more iterations: 9m 17s, 100 more iterations: 1h 32m 52s, 500 more iterations: 7h 44m 22s.
[2025-08-20 08:50:40,979][__main__][INFO] - Starting iteration 41.
[2025-08-20 08:51:04,320][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 08:51:04,321][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 08:51:04,327][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 08:51:06,805][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 08:51:06,807][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 08:51:06,813][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 08:51:06,816][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data.
[2025-08-20 08:51:06,816][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets.
[2025-08-20 08:51:07,116][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:51:07,908][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:51:08,703][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:51:09,495][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:51:10,290][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:51:11,087][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:51:11,881][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:51:12,674][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:51:13,469][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:51:14,266][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:51:15,061][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:51:15,854][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:51:16,649][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:51:17,444][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:51:18,238][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:51:19,030][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:51:19,826][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:51:21,078][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:51:21,871][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:51:22,666][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:51:23,463][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:51:24,257][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:51:25,051][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:51:25,846][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:51:26,644][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:51:27,440][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:51:28,234][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:51:29,028][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:51:29,824][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:51:30,622][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:51:31,417][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:51:32,211][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:51:33,781][mllm.training.trainer_common][INFO] - For task: Apply reinforce step
,
            ΔVRAM Allocated: 2.288818359375e-05 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:26,
            Percentage of VRAM taken: 69.91579512747062%,
            
[2025-08-20 08:51:34,759][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/policy_optimizer_state.pt
[2025-08-20 08:51:34,760][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/critic_optimizer_state.pt
[2025-08-20 08:51:36,096][__main__][INFO] - Iteration 42 took 55s (37.81% Gen, 62.18% Train). Generation: 20s, Training: 34s. Estimated remaining time: 14h 37m 31s. Estimated total time: 15h 18m 36s. Time estimates for 10 more iterations: 9m 11s, 100 more iterations: 1h 31m 51s, 500 more iterations: 7h 39m 18s.
[2025-08-20 08:51:36,098][__main__][INFO] - Starting iteration 42.
[2025-08-20 08:52:00,432][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 08:52:00,433][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 08:52:00,440][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 08:52:02,919][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 08:52:02,920][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 08:52:02,927][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 08:52:02,929][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data.
[2025-08-20 08:52:02,929][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets.
[2025-08-20 08:52:03,228][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:52:04,018][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:52:04,812][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:52:05,608][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:52:06,402][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:52:07,195][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:52:07,990][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:52:08,785][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:52:09,579][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:52:10,373][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:52:11,169][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:52:11,964][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:52:12,757][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:52:13,550][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:52:14,345][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:52:15,138][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:52:15,931][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:52:16,727][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:52:17,522][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:52:18,315][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:52:19,111][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:52:19,908][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:52:20,704][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:52:21,497][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:52:22,291][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:52:23,633][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:52:24,427][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:52:25,223][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:52:26,015][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:52:26,808][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:52:27,604][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:52:28,400][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:52:29,979][mllm.training.trainer_common][INFO] - For task: Apply reinforce step
,
            ΔVRAM Allocated: 2.288818359375e-05 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:27,
            Percentage of VRAM taken: 69.91579512747062%,
            
[2025-08-20 08:52:30,979][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/policy_optimizer_state.pt
[2025-08-20 08:52:30,982][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/critic_optimizer_state.pt
[2025-08-20 08:52:32,290][__main__][INFO] - Iteration 43 took 56s (38.87% Gen, 61.13% Train). Generation: 21s, Training: 34s. Estimated remaining time: 14h 54m 30s. Estimated total time: 15h 36m 31s. Time estimates for 10 more iterations: 9m 21s, 100 more iterations: 1h 33m 39s, 500 more iterations: 7h 48m 15s.
[2025-08-20 08:52:32,292][__main__][INFO] - Starting iteration 43.
[2025-08-20 08:52:55,870][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 08:52:55,871][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 08:52:55,877][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 08:52:58,347][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 08:52:58,348][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 08:52:58,355][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 08:52:58,357][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data.
[2025-08-20 08:52:58,358][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets.
[2025-08-20 08:52:58,656][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:52:59,447][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:53:00,238][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:53:01,033][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:53:01,826][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:53:02,619][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:53:03,414][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:53:04,209][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:53:05,002][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:53:05,794][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:53:06,588][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:53:07,380][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:53:08,173][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:53:08,967][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:53:09,759][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:53:10,552][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:53:11,346][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:53:12,138][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:53:12,931][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:53:13,726][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:53:14,519][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:53:15,311][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:53:16,605][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:53:17,398][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:53:18,192][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:53:18,989][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:53:19,784][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:53:20,579][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:53:21,373][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:53:22,169][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:53:22,965][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:53:23,760][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:53:25,321][mllm.training.trainer_common][INFO] - For task: Apply reinforce step
,
            ΔVRAM Allocated: 2.288818359375e-05 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:26,
            Percentage of VRAM taken: 69.91579512747062%,
            
[2025-08-20 08:53:26,518][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/policy_optimizer_state.pt
[2025-08-20 08:53:26,521][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/critic_optimizer_state.pt
[2025-08-20 08:53:27,886][__main__][INFO] - Iteration 44 took 55s (37.96% Gen, 62.04% Train). Generation: 21s, Training: 34s. Estimated remaining time: 14h 43m 36s. Estimated total time: 15h 26m 33s. Time estimates for 10 more iterations: 9m 15s, 100 more iterations: 1h 32m 39s, 500 more iterations: 7h 43m 16s.
[2025-08-20 08:53:27,888][__main__][INFO] - Starting iteration 44.
[2025-08-20 08:53:51,253][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 08:53:51,255][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 08:53:51,261][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 08:53:53,704][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 08:53:53,705][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 08:53:53,712][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 08:53:53,714][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data.
[2025-08-20 08:53:53,714][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets.
[2025-08-20 08:53:54,014][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:53:54,805][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:53:55,599][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:53:56,396][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:53:57,191][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:53:57,984][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:53:58,779][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:53:59,576][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:54:00,373][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:54:01,166][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:54:01,958][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:54:02,751][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:54:03,544][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:54:04,336][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:54:05,132][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:54:05,926][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:54:06,720][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:54:07,514][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:54:08,310][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:54:09,104][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:54:09,897][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:54:10,694][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:54:11,943][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:54:12,737][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:54:13,531][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:54:14,326][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:54:15,120][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:54:15,914][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:54:16,711][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:54:17,506][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:54:18,301][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:54:19,095][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:54:20,647][mllm.training.trainer_common][INFO] - For task: Apply reinforce step
,
            ΔVRAM Allocated: 2.288818359375e-05 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:26,
            Percentage of VRAM taken: 69.91579512747062%,
            
[2025-08-20 08:54:21,607][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/policy_optimizer_state.pt
[2025-08-20 08:54:21,608][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/critic_optimizer_state.pt
[2025-08-20 08:54:23,082][__main__][INFO] - Iteration 45 took 55s (37.88% Gen, 62.12% Train). Generation: 20s, Training: 34s. Estimated remaining time: 14h 36m 2s. Estimated total time: 15h 19m 54s. Time estimates for 10 more iterations: 9m 11s, 100 more iterations: 1h 31m 59s, 500 more iterations: 7h 39m 57s.
[2025-08-20 08:54:23,084][__main__][INFO] - Starting iteration 45.
[2025-08-20 08:54:46,763][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 08:54:46,764][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 08:54:46,770][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 08:54:49,237][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 08:54:49,239][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 08:54:49,246][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 08:54:49,249][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data.
[2025-08-20 08:54:49,249][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets.
[2025-08-20 08:54:49,548][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:54:50,339][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:54:51,133][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:54:51,926][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:54:52,721][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:54:53,518][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:54:54,313][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:54:55,107][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:54:55,902][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:54:56,698][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:54:57,493][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:54:58,284][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:54:59,078][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:54:59,874][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:55:00,667][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:55:01,459][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:55:02,741][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:55:03,535][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:55:04,328][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:55:05,121][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:55:05,916][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:55:06,711][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:55:07,503][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:55:08,297][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:55:09,533][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:55:10,325][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:55:11,117][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:55:11,912][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:55:12,705][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:55:13,498][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:55:14,292][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:55:15,086][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:55:16,678][mllm.training.trainer_common][INFO] - For task: Apply reinforce step
,
            ΔVRAM Allocated: 2.288818359375e-05 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:27,
            Percentage of VRAM taken: 69.91579512747062%,
            
[2025-08-20 08:55:17,876][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/policy_optimizer_state.pt
[2025-08-20 08:55:17,878][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/critic_optimizer_state.pt
[2025-08-20 08:55:19,243][__main__][INFO] - Iteration 46 took 56s (37.77% Gen, 62.23% Train). Generation: 21s, Training: 34s. Estimated remaining time: 14h 51m 10s. Estimated total time: 15h 35m 59s. Time estimates for 10 more iterations: 9m 21s, 100 more iterations: 1h 33m 35s, 500 more iterations: 7h 47m 59s.
[2025-08-20 08:55:19,246][__main__][INFO] - Starting iteration 46.
[2025-08-20 08:55:42,803][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 08:55:42,804][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 08:55:42,810][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 08:55:45,261][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 08:55:45,263][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 08:55:45,269][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 08:55:45,272][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data.
[2025-08-20 08:55:45,272][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets.
[2025-08-20 08:55:45,572][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:55:46,363][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:55:47,159][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:55:47,952][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:55:48,746][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:55:49,542][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:55:50,339][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:55:51,133][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:55:51,927][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:55:52,723][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:55:53,521][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:55:54,319][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:55:55,115][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:55:55,909][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:55:56,704][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:55:57,499][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:55:58,295][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:55:59,088][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:55:59,881][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:56:00,677][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:56:01,981][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:56:02,777][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:56:03,572][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:56:04,365][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:56:05,159][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:56:05,956][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:56:06,753][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:56:07,546][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:56:08,340][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:56:09,137][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:56:09,933][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:56:10,728][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:56:12,274][mllm.training.trainer_common][INFO] - For task: Apply reinforce step
,
            ΔVRAM Allocated: 2.288818359375e-05 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:26,
            Percentage of VRAM taken: 69.91579512747062%,
            
[2025-08-20 08:56:13,247][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/policy_optimizer_state.pt
[2025-08-20 08:56:13,249][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/critic_optimizer_state.pt
[2025-08-20 08:56:14,601][__main__][INFO] - Iteration 47 took 55s (38.13% Gen, 61.87% Train). Generation: 21s, Training: 34s. Estimated remaining time: 14h 36m 51s. Estimated total time: 15h 22m 35s. Time estimates for 10 more iterations: 9m 13s, 100 more iterations: 1h 32m 15s, 500 more iterations: 7h 41m 17s.
[2025-08-20 08:56:14,603][__main__][INFO] - Starting iteration 47.
[2025-08-20 08:56:38,329][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 08:56:38,330][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 08:56:38,337][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 08:56:40,809][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 08:56:40,810][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 08:56:40,816][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 08:56:40,819][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data.
[2025-08-20 08:56:40,819][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets.
[2025-08-20 08:56:41,120][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:56:41,911][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:56:42,705][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:56:43,500][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:56:44,293][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:56:45,087][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:56:45,884][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:56:46,679][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:56:47,474][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:56:48,267][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:56:49,064][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:56:49,862][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:56:50,657][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:56:51,451][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:56:52,244][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:56:53,041][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:56:53,836][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:56:54,628][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:56:55,422][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:56:56,218][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:56:57,011][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:56:58,265][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:56:59,058][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:56:59,852][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:57:00,645][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:57:01,439][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:57:02,235][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:57:03,028][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:57:03,821][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:57:04,619][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:57:05,414][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:57:06,207][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:57:07,764][mllm.training.trainer_common][INFO] - For task: Apply reinforce step
,
            ΔVRAM Allocated: 2.288818359375e-05 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:26,
            Percentage of VRAM taken: 69.91579512747062%,
            
[2025-08-20 08:57:08,720][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/policy_optimizer_state.pt
[2025-08-20 08:57:08,721][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/critic_optimizer_state.pt
[2025-08-20 08:57:09,959][__main__][INFO] - Iteration 48 took 55s (38.43% Gen, 61.57% Train). Generation: 21s, Training: 34s. Estimated remaining time: 14h 35m 57s. Estimated total time: 15h 22m 36s. Time estimates for 10 more iterations: 9m 13s, 100 more iterations: 1h 32m 15s, 500 more iterations: 7h 41m 18s.
[2025-08-20 08:57:09,961][__main__][INFO] - Starting iteration 48.
[2025-08-20 08:57:34,428][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 08:57:34,429][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 08:57:34,436][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 08:57:36,881][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 08:57:36,882][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 08:57:36,889][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 08:57:36,891][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data.
[2025-08-20 08:57:36,891][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets.
[2025-08-20 08:57:37,190][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:57:37,983][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:57:38,775][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:57:39,568][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:57:40,364][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:57:41,158][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:57:41,952][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:57:42,746][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:57:43,542][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:57:44,336][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:57:45,130][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:57:46,719][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:57:47,510][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:57:48,304][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:57:49,098][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:57:49,891][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:57:50,685][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:57:51,480][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:57:52,273][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:57:53,066][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:57:53,862][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:57:54,657][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:57:55,450][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:57:56,244][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:57:57,041][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:57:58,240][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:57:59,035][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:57:59,828][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:58:00,622][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:58:01,419][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:58:02,214][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:58:03,007][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:58:04,560][mllm.training.trainer_common][INFO] - For task: Apply reinforce step
,
            ΔVRAM Allocated: 2.288818359375e-05 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:27,
            Percentage of VRAM taken: 69.91579512747062%,
            
[2025-08-20 08:58:05,512][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/policy_optimizer_state.pt
[2025-08-20 08:58:05,513][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/critic_optimizer_state.pt
[2025-08-20 08:58:06,856][__main__][INFO] - Iteration 49 took 56s (38.69% Gen, 61.31% Train). Generation: 22s, Training: 34s. Estimated remaining time: 15h 0m 38s. Estimated total time: 15h 48m 14s. Time estimates for 10 more iterations: 9m 28s, 100 more iterations: 1h 34m 49s, 500 more iterations: 7h 54m 7s.
[2025-08-20 08:58:06,857][__main__][INFO] - Starting iteration 49.
[2025-08-20 08:58:31,730][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 08:58:31,732][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 08:58:31,738][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 08:58:34,199][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 08:58:34,200][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 08:58:34,207][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 08:58:34,209][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data.
[2025-08-20 08:58:34,210][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets.
[2025-08-20 08:58:34,508][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:58:35,299][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:58:36,093][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:58:36,889][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:58:37,685][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:58:38,478][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:58:39,272][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:58:40,069][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:58:40,864][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:58:41,655][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:58:42,448][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:58:43,242][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:58:44,033][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:58:44,827][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:58:45,622][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:58:46,415][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:58:47,209][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:58:48,005][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:58:48,800][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:58:49,594][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:58:50,387][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:58:51,184][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:58:51,979][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:58:52,771][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:58:53,565][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:58:54,809][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:58:55,604][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:58:56,400][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:58:57,194][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:58:57,988][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:58:58,784][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:58:59,581][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:59:01,131][mllm.training.trainer_common][INFO] - For task: Apply reinforce step
,
            ΔVRAM Allocated: 2.288818359375e-05 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:26,
            Percentage of VRAM taken: 69.91579512747062%,
            
[2025-08-20 08:59:02,106][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/policy_optimizer_state.pt
[2025-08-20 08:59:02,107][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/critic_optimizer_state.pt
[2025-08-20 08:59:03,460][__main__][INFO] - Iteration 50 took 56s (39.59% Gen, 60.41% Train). Generation: 22s, Training: 34s. Estimated remaining time: 14h 54m 49s. Estimated total time: 15h 43m 22s. Time estimates for 10 more iterations: 9m 26s, 100 more iterations: 1h 34m 20s, 500 more iterations: 7h 51m 41s.
[2025-08-20 08:59:03,462][__main__][INFO] - Starting iteration 50.
[2025-08-20 08:59:27,816][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 08:59:27,818][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 08:59:27,824][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 08:59:30,323][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 08:59:30,324][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 08:59:30,330][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 08:59:30,333][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data.
[2025-08-20 08:59:30,333][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets.
[2025-08-20 08:59:30,632][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:59:31,422][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:59:32,213][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:59:33,008][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:59:33,799][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:59:34,592][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:59:35,387][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:59:36,179][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:59:36,972][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:59:37,768][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:59:38,559][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:59:39,350][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:59:40,146][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:59:40,937][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:59:41,730][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:59:42,524][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:59:43,316][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:59:44,109][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:59:44,903][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:59:45,695][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:59:46,489][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:59:47,285][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:59:48,077][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:59:48,870][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:59:50,120][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:59:50,915][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:59:51,709][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:59:52,503][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:59:53,296][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:59:54,090][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:59:54,886][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:59:55,682][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 08:59:57,256][mllm.training.trainer_common][INFO] - For task: Apply reinforce step
,
            ΔVRAM Allocated: 2.288818359375e-05 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:26,
            Percentage of VRAM taken: 69.91579512747062%,
            
[2025-08-20 08:59:58,240][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/policy_optimizer_state.pt
[2025-08-20 08:59:58,241][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/critic_optimizer_state.pt
[2025-08-20 09:00:02,757][__main__][INFO] - Iteration 51 took 59s (36.87% Gen, 58.90% Train). Generation: 21s, Training: 34s. Estimated remaining time: 15h 38m 43s. Estimated total time: 16h 28m 15s. Time estimates for 10 more iterations: 9m 52s, 100 more iterations: 1h 38m 49s, 500 more iterations: 8h 14m 7s.
[2025-08-20 09:00:02,759][__main__][INFO] - Starting iteration 51.
[2025-08-20 09:00:26,816][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 09:00:26,817][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 09:00:26,824][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 09:00:29,285][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 09:00:29,287][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 09:00:29,293][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 09:00:29,296][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data.
[2025-08-20 09:00:29,296][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets.
[2025-08-20 09:00:29,594][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:00:30,384][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:00:31,174][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:00:31,966][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:00:32,757][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:00:33,550][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:00:34,341][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:00:35,131][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:00:35,923][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:00:36,713][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:00:37,505][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:00:38,295][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:00:39,089][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:00:39,879][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:00:40,671][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:00:41,465][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:00:42,255][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:00:43,049][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:00:44,253][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:00:45,047][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:00:45,841][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:00:46,634][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:00:47,430][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:00:48,227][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:00:49,021][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:00:49,814][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:00:50,609][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:00:51,784][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:00:52,578][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:00:53,371][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:00:54,168][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:00:54,965][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:00:56,536][mllm.training.trainer_common][INFO] - For task: Apply reinforce step
,
            ΔVRAM Allocated: 2.288818359375e-05 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:27,
            Percentage of VRAM taken: 69.91579512747062%,
            
[2025-08-20 09:00:57,492][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/policy_optimizer_state.pt
[2025-08-20 09:00:57,493][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/critic_optimizer_state.pt
[2025-08-20 09:00:58,808][__main__][INFO] - Iteration 52 took 56s (38.56% Gen, 61.43% Train). Generation: 21s, Training: 34s. Estimated remaining time: 14h 43m 40s. Estimated total time: 15h 34m 8s. Time estimates for 10 more iterations: 9m 20s, 100 more iterations: 1h 33m 24s, 500 more iterations: 7h 47m 4s.
[2025-08-20 09:00:58,809][__main__][INFO] - Starting iteration 52.
[2025-08-20 09:01:26,815][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 09:01:26,816][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 09:01:26,822][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 09:01:29,274][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 09:01:29,275][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 09:01:29,281][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 09:01:29,283][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data.
[2025-08-20 09:01:29,284][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets.
[2025-08-20 09:01:29,583][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:01:30,387][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:01:31,175][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:01:31,964][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:01:32,753][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:01:33,543][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:01:34,333][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:01:35,123][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:01:35,913][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:01:36,703][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:01:37,493][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:01:38,285][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:01:39,075][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:01:39,868][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:01:40,659][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:01:41,451][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:01:42,243][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:01:43,033][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:01:43,827][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:01:44,618][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:01:45,410][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:01:46,202][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:01:47,470][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:01:48,264][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:01:49,056][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:01:49,849][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:01:50,643][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:01:51,435][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:01:52,229][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:01:53,025][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:01:53,818][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:01:54,609][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:01:56,162][mllm.training.trainer_common][INFO] - For task: Apply reinforce step
,
            ΔVRAM Allocated: 2.288818359375e-05 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:26,
            Percentage of VRAM taken: 69.91579512747062%,
            
[2025-08-20 09:01:59,527][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/policy_optimizer_state.pt
[2025-08-20 09:01:59,532][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/critic_optimizer_state.pt
[2025-08-20 09:02:00,793][__main__][INFO] - Iteration 53 took 1m 1s (41.19% Gen, 58.81% Train). Generation: 25s, Training: 36s. Estimated remaining time: 16h 21m 34s. Estimated total time: 17h 13m 3s. Time estimates for 10 more iterations: 10m 19s, 100 more iterations: 1h 43m 18s, 500 more iterations: 8h 36m 31s.
[2025-08-20 09:02:00,795][__main__][INFO] - Starting iteration 53.
[2025-08-20 09:02:24,238][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 09:02:24,240][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 09:02:24,246][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 09:02:26,710][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 09:02:26,711][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 09:02:26,717][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 09:02:26,719][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data.
[2025-08-20 09:02:26,720][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets.
[2025-08-20 09:02:27,018][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:02:27,810][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:02:28,601][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:02:29,396][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:02:30,191][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:02:30,982][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:02:31,775][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:02:32,569][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:02:33,361][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:02:34,156][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:02:34,950][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:02:35,743][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:02:36,536][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:02:37,332][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:02:38,126][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:02:38,919][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:02:39,714][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:02:40,509][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:02:41,302][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:02:42,097][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:02:42,893][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:02:43,688][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:02:44,481][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:02:45,275][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:02:46,072][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:02:46,867][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:02:48,124][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:02:48,915][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:02:49,710][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:02:50,503][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:02:51,294][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:02:52,088][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:02:53,640][mllm.training.trainer_common][INFO] - For task: Apply reinforce step
,
            ΔVRAM Allocated: 2.288818359375e-05 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:26,
            Percentage of VRAM taken: 69.91579512747062%,
            
[2025-08-20 09:02:54,631][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/policy_optimizer_state.pt
[2025-08-20 09:02:54,633][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/critic_optimizer_state.pt
[2025-08-20 09:02:55,955][__main__][INFO] - Iteration 54 took 55s (38.03% Gen, 61.96% Train). Generation: 20s, Training: 34s. Estimated remaining time: 14h 26m 54s. Estimated total time: 15h 19m 19s. Time estimates for 10 more iterations: 9m 11s, 100 more iterations: 1h 31m 55s, 500 more iterations: 7h 39m 39s.
[2025-08-20 09:02:55,957][__main__][INFO] - Starting iteration 54.
[2025-08-20 09:03:19,342][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 09:03:19,343][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 09:03:19,349][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 09:03:21,803][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 09:03:21,805][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 09:03:21,811][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 09:03:21,813][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data.
[2025-08-20 09:03:21,814][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets.
[2025-08-20 09:03:22,112][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:03:22,900][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:03:23,693][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:03:24,483][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:03:25,279][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:03:26,072][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:03:26,864][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:03:27,658][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:03:28,451][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:03:29,243][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:03:30,038][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:03:30,832][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:03:31,624][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:03:32,418][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:03:33,213][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:03:34,006][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:03:34,799][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:03:35,596][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:03:36,391][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:03:37,184][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:03:37,978][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:03:39,212][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:03:40,006][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:03:40,800][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:03:41,594][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:03:42,389][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:03:43,183][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:03:43,976][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:03:44,773][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:03:45,568][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:03:46,362][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:03:47,156][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:03:48,743][mllm.training.trainer_common][INFO] - For task: Apply reinforce step
,
            ΔVRAM Allocated: 2.288818359375e-05 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:26,
            Percentage of VRAM taken: 69.91579512747062%,
            
[2025-08-20 09:03:49,699][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/policy_optimizer_state.pt
[2025-08-20 09:03:49,701][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/critic_optimizer_state.pt
[2025-08-20 09:03:51,070][__main__][INFO] - Iteration 55 took 55s (38.00% Gen, 62.00% Train). Generation: 20s, Training: 34s. Estimated remaining time: 14h 25m 12s. Estimated total time: 15h 18m 32s. Time estimates for 10 more iterations: 9m 11s, 100 more iterations: 1h 31m 51s, 500 more iterations: 7h 39m 16s.
[2025-08-20 09:03:51,071][__main__][INFO] - Starting iteration 55.
[2025-08-20 09:04:14,836][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 09:04:14,838][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 09:04:14,844][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 09:04:17,300][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 09:04:17,301][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 09:04:17,308][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 09:04:17,310][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data.
[2025-08-20 09:04:17,310][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets.
[2025-08-20 09:04:17,612][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:04:18,402][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:04:19,194][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:04:19,986][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:04:20,781][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:04:22,707][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:04:23,501][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:04:24,295][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:04:25,087][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:04:25,881][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:04:26,676][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:04:27,469][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:04:28,262][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:04:29,058][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:04:29,853][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:04:30,646][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:04:31,440][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:04:32,701][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:04:33,496][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:04:34,290][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:04:35,083][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:04:35,878][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:04:36,674][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:04:37,469][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:04:38,262][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:04:39,057][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:04:39,854][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:04:40,648][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:04:41,442][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:04:42,237][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:04:43,034][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:04:43,830][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:04:45,454][mllm.training.trainer_common][INFO] - For task: Apply reinforce step
,
            ΔVRAM Allocated: 2.288818359375e-05 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:28,
            Percentage of VRAM taken: 69.91579512747062%,
            
[2025-08-20 09:04:46,433][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/policy_optimizer_state.pt
[2025-08-20 09:04:46,434][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/critic_optimizer_state.pt
[2025-08-20 09:04:47,751][__main__][INFO] - Iteration 56 took 56s (37.58% Gen, 62.41% Train). Generation: 21s, Training: 35s. Estimated remaining time: 14h 50m 23s. Estimated total time: 15h 44m 39s. Time estimates for 10 more iterations: 9m 26s, 100 more iterations: 1h 34m 27s, 500 more iterations: 7h 52m 19s.
[2025-08-20 09:04:47,753][__main__][INFO] - Starting iteration 56.
[2025-08-20 09:05:11,130][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 09:05:11,131][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 09:05:11,137][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 09:05:13,561][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 09:05:13,562][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 09:05:13,568][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 09:05:13,570][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data.
[2025-08-20 09:05:13,571][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets.
[2025-08-20 09:05:13,869][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:05:14,661][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:05:15,453][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:05:16,245][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:05:17,040][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:05:17,832][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:05:18,625][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:05:19,421][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:05:20,214][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:05:21,006][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:05:21,819][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:05:22,611][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:05:23,404][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:05:24,198][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:05:24,991][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:05:25,785][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:05:26,581][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:05:27,376][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:05:28,170][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:05:28,964][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:05:29,761][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:05:30,556][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:05:31,349][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:05:32,144][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:05:32,943][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:05:33,740][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:05:34,537][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:05:35,331][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:05:36,125][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:05:37,454][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:05:38,249][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:05:39,043][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:05:40,607][mllm.training.trainer_common][INFO] - For task: Apply reinforce step
,
            ΔVRAM Allocated: 2.288818359375e-05 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:27,
            Percentage of VRAM taken: 69.91579512747062%,
            
[2025-08-20 09:05:41,723][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/policy_optimizer_state.pt
[2025-08-20 09:05:41,725][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/critic_optimizer_state.pt
[2025-08-20 09:05:43,102][__main__][INFO] - Iteration 57 took 55s (37.85% Gen, 62.14% Train). Generation: 20s, Training: 34s. Estimated remaining time: 14h 27m 17s. Estimated total time: 15h 22m 29s. Time estimates for 10 more iterations: 9m 13s, 100 more iterations: 1h 32m 14s, 500 more iterations: 7h 41m 14s.
[2025-08-20 09:05:43,104][__main__][INFO] - Starting iteration 57.
[2025-08-20 09:06:07,652][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 09:06:07,653][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 09:06:07,660][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 09:06:10,109][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 09:06:10,111][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 09:06:10,117][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 09:06:10,119][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data.
[2025-08-20 09:06:10,120][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets.
[2025-08-20 09:06:10,418][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:06:11,207][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:06:11,999][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:06:12,790][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:06:13,586][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:06:14,378][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:06:15,170][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:06:15,965][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:06:16,758][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:06:17,550][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:06:18,346][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:06:19,141][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:06:19,934][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:06:20,727][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:06:21,523][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:06:22,317][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:06:23,111][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:06:23,906][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:06:24,702][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:06:25,497][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:06:26,290][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:06:27,579][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:06:28,371][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:06:29,164][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:06:29,960][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:06:30,753][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:06:31,545][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:06:32,341][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:06:33,136][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:06:33,931][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:06:34,724][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:06:35,520][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:06:37,121][mllm.training.trainer_common][INFO] - For task: Apply reinforce step
,
            ΔVRAM Allocated: 2.288818359375e-05 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:26,
            Percentage of VRAM taken: 69.91579512747062%,
            
[2025-08-20 09:06:38,112][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/policy_optimizer_state.pt
[2025-08-20 09:06:38,113][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/critic_optimizer_state.pt
[2025-08-20 09:06:39,458][__main__][INFO] - Iteration 58 took 56s (39.20% Gen, 60.80% Train). Generation: 22s, Training: 34s. Estimated remaining time: 14h 43m 5s. Estimated total time: 15h 39m 13s. Time estimates for 10 more iterations: 9m 23s, 100 more iterations: 1h 33m 55s, 500 more iterations: 7h 49m 36s.
[2025-08-20 09:06:39,460][__main__][INFO] - Starting iteration 58.
[2025-08-20 09:07:04,486][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 09:07:04,488][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 09:07:04,494][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 09:07:06,964][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 09:07:06,966][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 09:07:06,972][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 09:07:06,974][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data.
[2025-08-20 09:07:06,975][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets.
[2025-08-20 09:07:07,274][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:07:08,064][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:07:08,853][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:07:09,646][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:07:10,436][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:07:11,228][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:07:12,021][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:07:12,811][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:07:13,605][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:07:14,397][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:07:15,189][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:07:15,983][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:07:16,775][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:07:17,567][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:07:18,362][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:07:19,154][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:07:19,947][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:07:20,743][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:07:21,536][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:07:22,328][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:07:23,124][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:07:24,431][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:07:25,225][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:07:26,019][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:07:26,811][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:07:27,605][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:07:28,403][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:07:29,200][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:07:29,995][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:07:30,789][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:07:31,585][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:07:32,384][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:07:33,982][mllm.training.trainer_common][INFO] - For task: Apply reinforce step
,
            ΔVRAM Allocated: 2.288818359375e-05 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:26,
            Percentage of VRAM taken: 69.91579512747062%,
            
[2025-08-20 09:07:34,947][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/policy_optimizer_state.pt
[2025-08-20 09:07:34,948][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/critic_optimizer_state.pt
[2025-08-20 09:07:36,283][__main__][INFO] - Iteration 59 took 56s (39.70% Gen, 60.29% Train). Generation: 22s, Training: 34s. Estimated remaining time: 14h 49m 58s. Estimated total time: 15h 47m 3s. Time estimates for 10 more iterations: 9m 28s, 100 more iterations: 1h 34m 42s, 500 more iterations: 7h 53m 31s.
[2025-08-20 09:07:36,849][__main__][INFO] - Starting iteration 59.
[2025-08-20 09:08:00,858][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 09:08:00,859][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 09:08:00,865][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 09:08:03,328][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 09:08:03,329][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 09:08:03,336][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 09:08:03,338][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data.
[2025-08-20 09:08:03,338][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets.
[2025-08-20 09:08:03,637][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:08:04,428][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:08:05,219][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:08:06,011][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:08:06,805][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:08:07,597][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:08:08,390][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:08:09,185][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:08:09,977][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:08:10,771][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:08:11,567][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:08:12,360][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:08:13,153][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:08:13,948][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:08:14,741][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:08:15,535][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:08:16,329][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:08:17,126][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:08:17,919][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:08:19,198][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:08:19,989][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:08:20,783][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:08:21,574][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:08:22,365][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:08:23,158][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:08:23,952][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:08:25,146][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:08:25,941][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:08:26,735][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:08:27,529][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:08:28,327][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:08:29,125][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:08:30,755][mllm.training.trainer_common][INFO] - For task: Apply reinforce step
,
            ΔVRAM Allocated: 2.288818359375e-05 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:27,
            Percentage of VRAM taken: 69.91579512747062%,
            
[2025-08-20 09:08:31,740][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/policy_optimizer_state.pt
[2025-08-20 09:08:31,742][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/critic_optimizer_state.pt
[2025-08-20 09:08:33,080][__main__][INFO] - Iteration 60 took 56s (38.32% Gen, 61.68% Train). Generation: 21s, Training: 34s. Estimated remaining time: 14h 39m 8s. Estimated total time: 15h 37m 10s. Time estimates for 10 more iterations: 9m 22s, 100 more iterations: 1h 33m 43s, 500 more iterations: 7h 48m 35s.
[2025-08-20 09:08:33,081][__main__][INFO] - Starting iteration 60.
[2025-08-20 09:08:56,807][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 09:08:56,809][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 09:08:56,815][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 09:08:59,274][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 09:08:59,276][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 09:08:59,282][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 09:08:59,284][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data.
[2025-08-20 09:08:59,285][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets.
[2025-08-20 09:08:59,584][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:09:00,373][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:09:01,166][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:09:01,957][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:09:02,752][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:09:03,546][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:09:04,338][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:09:05,133][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:09:05,928][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:09:06,722][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:09:07,515][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:09:08,312][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:09:09,107][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:09:09,900][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:09:10,694][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:09:11,490][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:09:12,285][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:09:13,078][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:09:13,872][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:09:15,157][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:09:15,951][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:09:16,747][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:09:17,542][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:09:18,337][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:09:19,132][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:09:19,928][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:09:20,725][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:09:21,520][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:09:22,314][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:09:23,110][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:09:23,909][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:09:24,707][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:09:26,314][mllm.training.trainer_common][INFO] - For task: Apply reinforce step
,
            ΔVRAM Allocated: 2.288818359375e-05 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:26,
            Percentage of VRAM taken: 69.91579512747062%,
            
[2025-08-20 09:09:27,322][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/policy_optimizer_state.pt
[2025-08-20 09:09:27,325][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/critic_optimizer_state.pt
[2025-08-20 09:09:28,917][__main__][INFO] - Iteration 61 took 55s (38.10% Gen, 61.90% Train). Generation: 21s, Training: 34s. Estimated remaining time: 14h 31m 37s. Estimated total time: 15h 30m 35s. Time estimates for 10 more iterations: 9m 18s, 100 more iterations: 1h 33m 3s, 500 more iterations: 7h 45m 17s.
[2025-08-20 09:09:28,919][__main__][INFO] - Starting iteration 61.
[2025-08-20 09:09:52,588][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 09:09:52,590][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 09:09:52,596][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 09:09:55,065][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 09:09:55,066][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 09:09:55,072][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 09:09:55,074][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data.
[2025-08-20 09:09:55,075][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets.
[2025-08-20 09:09:55,375][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:09:56,166][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:09:56,958][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:09:57,753][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:09:58,544][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:09:59,337][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:10:00,132][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:10:00,924][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:10:01,718][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:10:02,514][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:10:03,307][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:10:04,100][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:10:04,895][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:10:05,691][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:10:06,581][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:10:07,375][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:10:08,170][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:10:08,963][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:10:09,756][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:10:10,549][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:10:11,342][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:10:12,137][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:10:12,931][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:10:13,726][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:10:14,990][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:10:15,785][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:10:16,580][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:10:17,375][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:10:18,172][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:10:18,970][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:10:19,767][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:10:20,563][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:10:22,184][mllm.training.trainer_common][INFO] - For task: Apply reinforce step
,
            ΔVRAM Allocated: 2.288818359375e-05 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:27,
            Percentage of VRAM taken: 69.91579512747062%,
            
[2025-08-20 09:10:23,146][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/policy_optimizer_state.pt
[2025-08-20 09:10:23,147][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/critic_optimizer_state.pt
[2025-08-20 09:10:24,453][__main__][INFO] - Iteration 62 took 55s (38.20% Gen, 61.79% Train). Generation: 21s, Training: 34s. Estimated remaining time: 14h 25m 39s. Estimated total time: 15h 25m 33s. Time estimates for 10 more iterations: 9m 15s, 100 more iterations: 1h 32m 33s, 500 more iterations: 7h 42m 46s.
[2025-08-20 09:10:24,454][__main__][INFO] - Starting iteration 62.
[2025-08-20 09:10:47,650][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 09:10:47,651][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 09:10:47,658][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 09:10:50,136][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 09:10:50,137][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 09:10:50,143][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 09:10:50,146][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data.
[2025-08-20 09:10:50,146][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets.
[2025-08-20 09:10:50,445][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:10:51,237][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:10:52,029][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:10:52,821][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:10:53,616][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:10:54,408][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:10:55,201][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:10:55,997][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:10:56,790][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:10:57,583][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:10:58,378][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:10:59,173][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:10:59,966][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:11:00,759][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:11:01,555][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:11:02,349][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:11:03,143][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:11:03,937][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:11:04,733][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:11:05,995][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:11:06,788][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:11:07,581][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:11:08,374][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:11:09,169][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:11:09,961][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:11:10,757][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:11:11,554][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:11:12,350][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:11:13,144][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:11:13,937][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:11:14,734][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:11:15,532][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:11:17,168][mllm.training.trainer_common][INFO] - For task: Apply reinforce step
,
            ΔVRAM Allocated: 2.288818359375e-05 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:26,
            Percentage of VRAM taken: 69.91579512747062%,
            
[2025-08-20 09:11:18,114][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/policy_optimizer_state.pt
[2025-08-20 09:11:18,115][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/critic_optimizer_state.pt
[2025-08-20 09:11:19,427][__main__][INFO] - Iteration 63 took 54s (37.68% Gen, 62.31% Train). Generation: 20s, Training: 34s. Estimated remaining time: 14h 15m 23s. Estimated total time: 15h 16m 12s. Time estimates for 10 more iterations: 9m 9s, 100 more iterations: 1h 31m 37s, 500 more iterations: 7h 38m 6s.
[2025-08-20 09:11:19,428][__main__][INFO] - Starting iteration 63.
[2025-08-20 09:11:43,549][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 09:11:43,550][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 09:11:43,557][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 09:11:46,024][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 09:11:46,025][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 09:11:46,032][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 09:11:46,034][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data.
[2025-08-20 09:11:46,035][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets.
[2025-08-20 09:11:46,333][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:11:47,122][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:11:47,915][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:11:48,707][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:11:49,502][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:11:50,295][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:11:51,087][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:11:51,881][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:11:52,677][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:11:53,469][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:11:54,262][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:11:55,057][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:11:55,850][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:11:56,643][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:11:57,440][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:11:58,235][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:11:59,028][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:11:59,822][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:12:00,618][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:12:01,414][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:12:02,208][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:12:03,410][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:12:04,204][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:12:04,999][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:12:05,797][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:12:06,593][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:12:07,388][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:12:08,183][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:12:08,978][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:12:09,776][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:12:10,573][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:12:11,369][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:12:12,972][mllm.training.trainer_common][INFO] - For task: Apply reinforce step
,
            ΔVRAM Allocated: 2.288818359375e-05 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:26,
            Percentage of VRAM taken: 69.91579512747062%,
            
[2025-08-20 09:12:13,973][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/policy_optimizer_state.pt
[2025-08-20 09:12:13,975][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/critic_optimizer_state.pt
[2025-08-20 09:12:15,334][__main__][INFO] - Iteration 64 took 55s (38.74% Gen, 61.26% Train). Generation: 21s, Training: 34s. Estimated remaining time: 14h 30m 1s. Estimated total time: 15h 31m 45s. Time estimates for 10 more iterations: 9m 19s, 100 more iterations: 1h 33m 10s, 500 more iterations: 7h 45m 52s.
[2025-08-20 09:12:15,335][__main__][INFO] - Starting iteration 64.
[2025-08-20 09:12:38,599][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 09:12:38,601][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 09:12:38,607][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 09:12:41,053][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 09:12:41,054][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 09:12:41,061][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 09:12:41,063][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data.
[2025-08-20 09:12:41,064][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets.
[2025-08-20 09:12:41,363][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:12:42,152][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:12:42,945][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:12:43,738][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:12:44,530][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:12:45,324][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:12:46,119][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:12:46,912][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:12:47,706][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:12:48,503][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:12:49,298][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:12:50,091][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:12:50,884][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:12:51,680][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:12:52,473][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:12:53,265][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:12:54,061][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:12:55,287][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:12:56,080][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:12:56,875][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:12:57,667][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:12:58,461][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:12:59,257][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:13:00,051][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:13:00,845][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:13:01,641][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:13:02,439][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:13:03,235][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:13:04,030][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:13:04,824][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:13:05,621][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:13:06,419][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:13:08,047][mllm.training.trainer_common][INFO] - For task: Apply reinforce step
,
            ΔVRAM Allocated: 2.288818359375e-05 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:26,
            Percentage of VRAM taken: 69.91579512747062%,
            
[2025-08-20 09:13:08,979][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/policy_optimizer_state.pt
[2025-08-20 09:13:08,980][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/critic_optimizer_state.pt
[2025-08-20 09:13:10,288][__main__][INFO] - Iteration 65 took 54s (37.90% Gen, 62.09% Train). Generation: 20s, Training: 34s. Estimated remaining time: 14h 13m 13s. Estimated total time: 15h 15m 52s. Time estimates for 10 more iterations: 9m 9s, 100 more iterations: 1h 31m 35s, 500 more iterations: 7h 37m 56s.
[2025-08-20 09:13:10,290][__main__][INFO] - Starting iteration 65.
[2025-08-20 09:13:33,805][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 09:13:33,806][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 09:13:33,813][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 09:13:36,269][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 09:13:36,271][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 09:13:36,277][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 09:13:36,279][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data.
[2025-08-20 09:13:36,280][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets.
[2025-08-20 09:13:36,577][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:13:37,366][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:13:38,157][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:13:38,949][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:13:39,743][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:13:40,535][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:13:41,328][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:13:42,123][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:13:42,914][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:13:43,707][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:13:44,503][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:13:45,295][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:13:46,134][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:13:46,927][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:13:47,724][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:13:48,518][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:13:49,311][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:13:50,106][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:13:50,902][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:13:51,696][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:13:52,489][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:13:53,728][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:13:54,523][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:13:55,319][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:13:56,113][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:13:56,908][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:13:57,702][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:13:58,500][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:13:59,295][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:14:00,090][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:14:00,885][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:14:02,425][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:14:04,058][mllm.training.trainer_common][INFO] - For task: Apply reinforce step
,
            ΔVRAM Allocated: 2.288818359375e-05 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:27,
            Percentage of VRAM taken: 69.91579512747062%,
            
[2025-08-20 09:14:04,995][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/policy_optimizer_state.pt
[2025-08-20 09:14:04,997][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/critic_optimizer_state.pt
[2025-08-20 09:14:06,351][__main__][INFO] - Iteration 66 took 56s (37.58% Gen, 62.42% Train). Generation: 21s, Training: 34s. Estimated remaining time: 14h 30m 45s. Estimated total time: 15h 34m 20s. Time estimates for 10 more iterations: 9m 20s, 100 more iterations: 1h 33m 26s, 500 more iterations: 7h 47m 10s.
[2025-08-20 09:14:06,353][__main__][INFO] - Starting iteration 66.
[2025-08-20 09:14:30,010][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 09:14:30,011][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 09:14:30,018][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 09:14:32,496][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 09:14:32,497][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 09:14:32,504][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 09:14:32,506][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data.
[2025-08-20 09:14:32,507][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets.
[2025-08-20 09:14:32,807][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:14:33,597][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:14:34,390][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:14:35,183][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:14:35,975][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:14:36,769][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:14:37,562][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:14:38,354][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:14:39,149][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:14:39,943][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:14:40,736][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:14:41,529][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:14:42,325][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:14:43,118][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:14:43,911][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:14:44,707][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:14:45,501][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:14:46,294][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:14:47,089][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:14:47,886][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:14:49,086][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:14:49,880][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:14:50,672][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:14:51,467][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:14:52,264][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:14:53,059][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:14:53,853][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:14:54,646][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:14:55,443][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:14:56,240][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:14:57,033][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:14:57,828][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:14:59,445][mllm.training.trainer_common][INFO] - For task: Apply reinforce step
,
            ΔVRAM Allocated: 2.288818359375e-05 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:26,
            Percentage of VRAM taken: 69.91579512747062%,
            
[2025-08-20 09:15:00,404][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/policy_optimizer_state.pt
[2025-08-20 09:15:00,406][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/critic_optimizer_state.pt
[2025-08-20 09:15:01,719][__main__][INFO] - Iteration 67 took 55s (38.33% Gen, 61.67% Train). Generation: 21s, Training: 34s. Estimated remaining time: 14h 18m 15s. Estimated total time: 15h 22m 45s. Time estimates for 10 more iterations: 9m 13s, 100 more iterations: 1h 32m 16s, 500 more iterations: 7h 41m 22s.
[2025-08-20 09:15:01,721][__main__][INFO] - Starting iteration 67.
[2025-08-20 09:15:24,939][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 09:15:24,940][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 09:15:24,946][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 09:15:27,412][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 09:15:27,414][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 09:15:27,420][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 09:15:27,422][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data.
[2025-08-20 09:15:27,423][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets.
[2025-08-20 09:15:27,723][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:15:28,511][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:15:29,302][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:15:30,093][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:15:30,886][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:15:31,676][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:15:32,470][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:15:33,261][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:15:34,053][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:15:34,849][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:15:35,640][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:15:36,432][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:15:37,226][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:15:38,019][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:15:38,811][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:15:39,606][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:15:40,398][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:15:41,191][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:15:41,987][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:15:42,780][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:15:44,042][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:15:44,836][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:15:45,632][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:15:46,429][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:15:47,227][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:15:48,023][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:15:48,817][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:15:49,611][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:15:50,410][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:15:51,208][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:15:52,007][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:15:52,805][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:15:54,414][mllm.training.trainer_common][INFO] - For task: Apply reinforce step
,
            ΔVRAM Allocated: 2.288818359375e-05 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:26,
            Percentage of VRAM taken: 69.91579512747062%,
            
[2025-08-20 09:15:55,352][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/policy_optimizer_state.pt
[2025-08-20 09:15:55,353][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/critic_optimizer_state.pt
[2025-08-20 09:15:56,751][__main__][INFO] - Iteration 68 took 55s (37.73% Gen, 62.27% Train). Generation: 20s, Training: 34s. Estimated remaining time: 14h 11m 43s. Estimated total time: 15h 17m 9s. Time estimates for 10 more iterations: 9m 10s, 100 more iterations: 1h 31m 42s, 500 more iterations: 7h 38m 34s.
[2025-08-20 09:15:56,752][__main__][INFO] - Starting iteration 68.
[2025-08-20 09:16:19,880][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 09:16:19,914][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 09:16:19,934][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 09:16:22,384][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 09:16:22,385][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 09:16:22,392][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 09:16:22,394][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data.
[2025-08-20 09:16:22,395][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets.
[2025-08-20 09:16:22,694][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:16:23,484][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:16:24,277][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:16:25,071][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:16:25,864][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:16:26,657][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:16:27,452][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:16:28,245][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:16:29,038][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:16:29,835][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:16:30,630][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:16:31,423][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:16:32,216][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:16:33,013][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:16:33,809][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:16:34,603][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:16:35,396][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:16:36,192][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:16:36,987][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:16:38,290][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:16:39,082][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:16:39,874][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:16:40,671][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:16:41,464][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:16:42,257][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:16:43,054][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:16:43,850][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:16:44,645][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:16:45,439][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:16:46,233][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:16:47,030][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:16:47,827][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:16:49,459][mllm.training.trainer_common][INFO] - For task: Apply reinforce step
,
            ΔVRAM Allocated: 2.288818359375e-05 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:27,
            Percentage of VRAM taken: 69.91579512747062%,
            
[2025-08-20 09:16:50,388][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/policy_optimizer_state.pt
[2025-08-20 09:16:50,389][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/critic_optimizer_state.pt
[2025-08-20 09:16:51,653][__main__][INFO] - Iteration 69 took 54s (37.67% Gen, 62.32% Train). Generation: 20s, Training: 34s. Estimated remaining time: 14h 8m 39s. Estimated total time: 15h 15m 0s. Time estimates for 10 more iterations: 9m 9s, 100 more iterations: 1h 31m 30s, 500 more iterations: 7h 37m 30s.
[2025-08-20 09:16:51,655][__main__][INFO] - Starting iteration 69.
[2025-08-20 09:17:14,900][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 09:17:14,902][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 09:17:14,908][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 09:17:17,379][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 09:17:17,380][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 09:17:17,387][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 09:17:17,389][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data.
[2025-08-20 09:17:17,389][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets.
[2025-08-20 09:17:17,687][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:17:18,476][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:17:19,266][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:17:20,058][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:17:20,851][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:17:21,642][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:17:22,436][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:17:23,228][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:17:24,020][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:17:24,815][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:17:25,608][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:17:26,400][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:17:27,197][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:17:27,992][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:17:28,784][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:17:29,578][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:17:30,374][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:17:31,167][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:17:31,959][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:17:32,755][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:17:33,550][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:17:34,343][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:17:35,138][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:17:35,934][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:17:36,730][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:17:37,523][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:17:38,742][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:17:39,535][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:17:40,331][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:17:41,127][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:17:41,920][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:17:42,715][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:17:44,310][mllm.training.trainer_common][INFO] - For task: Apply reinforce step
,
            ΔVRAM Allocated: 2.288818359375e-05 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:26,
            Percentage of VRAM taken: 69.91579512747062%,
            
[2025-08-20 09:17:45,333][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/policy_optimizer_state.pt
[2025-08-20 09:17:45,335][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/critic_optimizer_state.pt
[2025-08-20 09:17:46,679][__main__][INFO] - Iteration 70 took 55s (37.74% Gen, 62.25% Train). Generation: 20s, Training: 34s. Estimated remaining time: 14h 9m 47s. Estimated total time: 15h 17m 3s. Time estimates for 10 more iterations: 9m 10s, 100 more iterations: 1h 31m 42s, 500 more iterations: 7h 38m 31s.
[2025-08-20 09:17:46,680][__main__][INFO] - Starting iteration 70.
[2025-08-20 09:18:09,872][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 09:18:09,874][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 09:18:09,880][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 09:18:12,351][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 09:18:12,353][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 09:18:12,359][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 09:18:12,361][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data.
[2025-08-20 09:18:12,362][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets.
[2025-08-20 09:18:12,662][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:18:13,453][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:18:14,245][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:18:15,040][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:18:15,833][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:18:16,626][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:18:17,421][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:18:18,215][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:18:19,008][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:18:19,802][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:18:20,600][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:18:21,395][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:18:22,188][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:18:22,981][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:18:23,778][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:18:24,573][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:18:25,366][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:18:26,161][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:18:26,957][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:18:27,753][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:18:29,101][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:18:29,898][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:18:30,695][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:18:31,490][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:18:32,285][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:18:33,079][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:18:33,876][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:18:34,674][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:18:35,470][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:18:36,264][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:18:37,058][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:18:37,857][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:18:39,459][mllm.training.trainer_common][INFO] - For task: Apply reinforce step
,
            ΔVRAM Allocated: 2.288818359375e-05 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:27,
            Percentage of VRAM taken: 69.91579512747062%,
            
[2025-08-20 09:18:40,413][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/policy_optimizer_state.pt
[2025-08-20 09:18:40,415][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/critic_optimizer_state.pt
[2025-08-20 09:18:42,078][__main__][INFO] - Iteration 71 took 55s (37.42% Gen, 62.58% Train). Generation: 20s, Training: 34s. Estimated remaining time: 14h 15m 5s. Estimated total time: 15h 23m 16s. Time estimates for 10 more iterations: 9m 13s, 100 more iterations: 1h 32m 19s, 500 more iterations: 7h 41m 38s.
[2025-08-20 09:18:42,079][__main__][INFO] - Starting iteration 71.
[2025-08-20 09:19:05,864][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 09:19:05,865][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 09:19:05,871][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 09:19:08,308][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 09:19:08,309][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 09:19:08,316][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 09:19:08,318][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data.
[2025-08-20 09:19:08,319][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets.
[2025-08-20 09:19:08,617][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:19:09,406][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:19:10,201][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:19:10,995][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:19:11,787][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:19:12,583][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:19:13,376][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:19:14,169][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:19:14,964][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:19:15,759][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:19:16,553][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:19:17,346][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:19:18,142][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:19:18,937][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:19:19,730][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:19:20,523][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:19:21,320][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:19:22,116][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:19:22,909][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:19:23,705][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:19:24,500][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:19:25,296][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:19:26,090][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:19:27,315][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:19:28,109][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:19:28,905][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:19:29,701][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:19:30,498][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:19:31,295][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:19:32,090][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:19:32,885][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:19:33,681][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:19:35,296][mllm.training.trainer_common][INFO] - For task: Apply reinforce step
,
            ΔVRAM Allocated: 2.288818359375e-05 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:26,
            Percentage of VRAM taken: 69.91579512747062%,
            
[2025-08-20 09:19:36,262][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/policy_optimizer_state.pt
[2025-08-20 09:19:36,263][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/critic_optimizer_state.pt
[2025-08-20 09:19:37,738][__main__][INFO] - Iteration 72 took 55s (38.36% Gen, 61.64% Train). Generation: 21s, Training: 34s. Estimated remaining time: 14h 18m 31s. Estimated total time: 15h 27m 38s. Time estimates for 10 more iterations: 9m 16s, 100 more iterations: 1h 32m 45s, 500 more iterations: 7h 43m 49s.
[2025-08-20 09:19:37,740][__main__][INFO] - Starting iteration 72.
[2025-08-20 09:20:00,955][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 09:20:00,957][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 09:20:00,963][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 09:20:03,427][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 09:20:03,428][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 09:20:03,434][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 09:20:03,437][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data.
[2025-08-20 09:20:03,437][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets.
[2025-08-20 09:20:03,735][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:20:04,523][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:20:05,313][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:20:06,107][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:20:06,901][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:20:07,693][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:20:08,485][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:20:09,278][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:20:10,069][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:20:10,864][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:20:11,657][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:20:12,448][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:20:13,244][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:20:14,037][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:20:14,830][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:20:15,624][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:20:16,417][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:20:17,209][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:20:18,005][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:20:18,801][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:20:19,594][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:20:20,387][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:20:21,181][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:20:21,976][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:20:22,770][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:20:24,049][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:20:24,844][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:20:25,641][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:20:26,438][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:20:27,234][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:20:28,028][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:20:28,823][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:20:30,429][mllm.training.trainer_common][INFO] - For task: Apply reinforce step
,
            ΔVRAM Allocated: 2.288818359375e-05 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:26,
            Percentage of VRAM taken: 69.91579512747062%,
            
[2025-08-20 09:20:31,400][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/policy_optimizer_state.pt
[2025-08-20 09:20:31,401][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/critic_optimizer_state.pt
[2025-08-20 09:20:32,819][__main__][INFO] - Iteration 73 took 55s (37.65% Gen, 62.35% Train). Generation: 20s, Training: 34s. Estimated remaining time: 14h 7m 56s. Estimated total time: 15h 17m 58s. Time estimates for 10 more iterations: 9m 10s, 100 more iterations: 1h 31m 47s, 500 more iterations: 7h 38m 59s.
[2025-08-20 09:20:32,821][__main__][INFO] - Starting iteration 73.
[2025-08-20 09:20:56,013][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 09:20:56,015][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 09:20:56,021][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 09:20:58,468][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 09:20:58,469][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 09:20:58,476][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 09:20:58,478][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data.
[2025-08-20 09:20:58,479][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets.
[2025-08-20 09:20:58,777][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:20:59,569][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:21:00,362][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:21:01,154][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:21:01,949][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:21:02,742][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:21:03,535][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:21:04,330][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:21:05,125][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:21:05,918][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:21:06,711][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:21:07,504][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:21:08,297][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:21:09,089][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:21:09,886][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:21:10,680][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:21:11,473][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:21:12,269][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:21:13,067][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:21:14,272][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:21:15,065][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:21:15,861][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:21:16,655][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:21:17,450][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:21:18,246][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:21:19,044][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:21:19,842][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:21:20,637][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:21:21,432][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:21:22,228][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:21:23,026][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:21:23,825][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:21:25,461][mllm.training.trainer_common][INFO] - For task: Apply reinforce step
,
            ΔVRAM Allocated: 2.288818359375e-05 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:26,
            Percentage of VRAM taken: 69.91579512747062%,
            
[2025-08-20 09:21:26,385][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/policy_optimizer_state.pt
[2025-08-20 09:21:26,387][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/critic_optimizer_state.pt
[2025-08-20 09:21:29,495][__main__][INFO] - Iteration 74 took 56s (36.61% Gen, 63.38% Train). Generation: 20s, Training: 35s. Estimated remaining time: 14h 33m 35s. Estimated total time: 15h 44m 33s. Time estimates for 10 more iterations: 9m 26s, 100 more iterations: 1h 34m 27s, 500 more iterations: 7h 52m 16s.
[2025-08-20 09:21:29,497][__main__][INFO] - Starting iteration 74.
[2025-08-20 09:21:52,487][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 09:21:52,488][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 09:21:52,495][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 09:21:54,941][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 09:21:54,942][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 09:21:54,949][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 09:21:54,951][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data.
[2025-08-20 09:21:54,951][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets.
[2025-08-20 09:21:55,250][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:21:56,040][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:21:56,832][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:21:57,626][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:21:58,417][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:21:59,213][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:22:00,010][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:22:00,804][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:22:01,597][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:22:02,390][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:22:03,185][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:22:03,978][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:22:04,772][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:22:05,569][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:22:06,363][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:22:07,156][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:22:07,952][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:22:08,749][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:22:09,544][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:22:10,339][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:22:11,132][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:22:11,928][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:22:12,725][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:22:13,520][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:22:14,314][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:22:15,109][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:22:15,907][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:22:16,704][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:22:17,953][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:22:18,747][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:22:19,545][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:22:20,342][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:22:21,964][mllm.training.trainer_common][INFO] - For task: Apply reinforce step
,
            ΔVRAM Allocated: 2.288818359375e-05 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:26,
            Percentage of VRAM taken: 69.91579512747062%,
            
[2025-08-20 09:22:23,005][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/policy_optimizer_state.pt
[2025-08-20 09:22:23,007][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/critic_optimizer_state.pt
[2025-08-20 09:22:24,416][__main__][INFO] - Iteration 75 took 54s (37.44% Gen, 62.56% Train). Generation: 20s, Training: 34s. Estimated remaining time: 14h 3m 25s. Estimated total time: 15h 15m 19s. Time estimates for 10 more iterations: 9m 9s, 100 more iterations: 1h 31m 31s, 500 more iterations: 7h 37m 39s.
[2025-08-20 09:22:24,419][__main__][INFO] - Starting iteration 75.
[2025-08-20 09:22:47,665][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 09:22:47,667][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 09:22:47,673][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 09:22:50,134][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 09:22:50,136][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 09:22:50,142][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 09:22:50,144][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data.
[2025-08-20 09:22:50,145][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets.
[2025-08-20 09:22:50,444][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:22:51,233][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:22:52,025][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:22:52,816][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:22:53,611][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:22:54,403][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:22:55,197][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:22:55,993][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:22:56,788][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:22:57,580][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:22:58,374][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:22:59,170][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:22:59,963][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:23:00,756][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:23:01,552][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:23:02,347][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:23:03,141][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:23:03,935][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:23:04,732][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:23:05,529][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:23:06,324][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:23:07,595][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:23:08,392][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:23:09,189][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:23:09,986][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:23:10,780][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:23:11,576][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:23:12,372][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:23:13,170][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:23:13,967][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:23:14,763][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:23:15,558][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:23:17,166][mllm.training.trainer_common][INFO] - For task: Apply reinforce step
,
            ΔVRAM Allocated: 2.288818359375e-05 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:26,
            Percentage of VRAM taken: 69.91579512747062%,
            
[2025-08-20 09:23:18,148][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/policy_optimizer_state.pt
[2025-08-20 09:23:18,150][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/critic_optimizer_state.pt
[2025-08-20 09:23:19,429][__main__][INFO] - Iteration 76 took 55s (37.80% Gen, 62.20% Train). Generation: 20s, Training: 34s. Estimated remaining time: 14h 4m 1s. Estimated total time: 15h 16m 49s. Time estimates for 10 more iterations: 9m 10s, 100 more iterations: 1h 31m 40s, 500 more iterations: 7h 38m 24s.
[2025-08-20 09:23:19,430][__main__][INFO] - Starting iteration 76.
[2025-08-20 09:23:45,655][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 09:23:45,656][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 09:23:45,663][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 09:23:48,086][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 09:23:48,088][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 09:23:48,094][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 09:23:48,096][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data.
[2025-08-20 09:23:48,096][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets.
[2025-08-20 09:23:48,394][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:23:49,183][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:23:49,976][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:23:50,767][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:23:51,559][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:23:52,354][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:23:53,146][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:23:53,938][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:23:54,734][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:23:55,526][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:23:56,319][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:23:57,115][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:23:57,910][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:23:58,702][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:23:59,496][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:24:00,293][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:24:01,088][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:24:01,882][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:24:02,677][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:24:03,473][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:24:04,269][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:24:05,502][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:24:06,296][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:24:07,093][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:24:07,889][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:24:08,684][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:24:09,478][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:24:10,273][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:24:11,072][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:24:11,870][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:24:12,667][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:24:13,462][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:24:15,059][mllm.training.trainer_common][INFO] - For task: Apply reinforce step
,
            ΔVRAM Allocated: 2.288818359375e-05 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:26,
            Percentage of VRAM taken: 69.91579512747062%,
            
[2025-08-20 09:24:15,990][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/policy_optimizer_state.pt
[2025-08-20 09:24:15,991][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/critic_optimizer_state.pt
[2025-08-20 09:24:17,610][__main__][INFO] - Iteration 77 took 58s (40.90% Gen, 59.10% Train). Generation: 23s, Training: 34s. Estimated remaining time: 14h 55m 52s. Estimated total time: 16h 9m 39s. Time estimates for 10 more iterations: 9m 41s, 100 more iterations: 1h 36m 57s, 500 more iterations: 8h 4m 49s.
[2025-08-20 09:24:17,612][__main__][INFO] - Starting iteration 77.
[2025-08-20 09:24:41,071][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 09:24:41,072][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 09:24:41,079][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 09:24:43,538][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 09:24:43,539][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 09:24:43,546][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 09:24:43,548][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data.
[2025-08-20 09:24:43,548][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets.
[2025-08-20 09:24:43,849][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:24:44,638][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:24:45,428][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:24:46,219][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:24:47,012][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:24:47,802][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:24:48,596][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:24:49,388][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:24:50,180][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:24:50,975][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:24:51,767][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:24:52,560][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:24:53,356][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:24:54,148][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:24:54,941][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:24:55,737][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:24:56,533][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:24:57,326][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:24:58,120][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:24:58,916][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:24:59,713][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:25:00,507][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:25:01,301][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:25:02,574][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:25:03,370][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:25:04,164][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:25:04,959][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:25:05,756][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:25:06,554][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:25:07,352][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:25:08,149][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:25:08,944][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:25:10,570][mllm.training.trainer_common][INFO] - For task: Apply reinforce step
,
            ΔVRAM Allocated: 2.288818359375e-05 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:26,
            Percentage of VRAM taken: 69.91579512747062%,
            
[2025-08-20 09:25:11,576][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/policy_optimizer_state.pt
[2025-08-20 09:25:11,579][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/critic_optimizer_state.pt
[2025-08-20 09:25:12,862][__main__][INFO] - Iteration 78 took 55s (38.05% Gen, 61.95% Train). Generation: 21s, Training: 34s. Estimated remaining time: 14h 6m 7s. Estimated total time: 15h 20m 49s. Time estimates for 10 more iterations: 9m 12s, 100 more iterations: 1h 32m 4s, 500 more iterations: 7h 40m 24s.
[2025-08-20 09:25:12,863][__main__][INFO] - Starting iteration 78.
[2025-08-20 09:25:35,914][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 09:25:35,916][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 09:25:35,922][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 09:25:38,375][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 09:25:38,377][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 09:25:38,383][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 09:25:38,385][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data.
[2025-08-20 09:25:38,386][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets.
[2025-08-20 09:25:38,684][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:25:39,475][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:25:40,267][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:25:41,062][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:25:41,856][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:25:42,648][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:25:43,442][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:25:44,237][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:25:45,030][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:25:45,823][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:25:46,621][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:25:47,415][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:25:48,208][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:25:49,002][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:25:49,797][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:25:50,592][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:25:51,387][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:25:52,181][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:25:52,979][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:25:53,776][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:25:54,571][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:25:55,796][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:25:56,591][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:25:57,386][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:25:58,181][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:25:58,978][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:25:59,777][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:26:00,574][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:26:01,370][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:26:02,164][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:26:02,959][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:26:03,756][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:26:05,365][mllm.training.trainer_common][INFO] - For task: Apply reinforce step
,
            ΔVRAM Allocated: 2.288818359375e-05 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:26,
            Percentage of VRAM taken: 69.91579512747062%,
            
[2025-08-20 09:26:06,304][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/policy_optimizer_state.pt
[2025-08-20 09:26:06,305][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/critic_optimizer_state.pt
[2025-08-20 09:26:07,800][__main__][INFO] - Iteration 79 took 54s (37.54% Gen, 62.45% Train). Generation: 20s, Training: 34s. Estimated remaining time: 14h 0m 0s. Estimated total time: 15h 15m 36s. Time estimates for 10 more iterations: 9m 9s, 100 more iterations: 1h 31m 33s, 500 more iterations: 7h 37m 48s.
[2025-08-20 09:26:07,802][__main__][INFO] - Starting iteration 79.
[2025-08-20 09:26:30,910][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 09:26:30,911][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 09:26:30,918][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 09:26:33,347][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 09:26:33,348][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 09:26:33,355][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 09:26:33,357][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data.
[2025-08-20 09:26:33,358][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets.
[2025-08-20 09:26:33,656][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:26:34,447][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:26:35,241][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:26:36,033][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:26:36,826][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:26:37,622][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:26:38,415][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:26:39,208][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:26:40,004][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:26:40,798][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:26:41,591][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:26:42,386][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:26:43,183][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:26:43,979][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:26:44,773][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:26:45,568][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:26:46,364][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:26:47,162][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:26:47,960][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:26:49,212][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:26:50,007][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:26:50,804][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:26:51,602][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:26:52,400][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:26:53,198][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:26:53,993][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:26:54,788][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:26:55,584][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:26:56,382][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:26:57,181][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:26:57,981][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:26:58,780][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:27:00,396][mllm.training.trainer_common][INFO] - For task: Apply reinforce step
,
            ΔVRAM Allocated: 2.288818359375e-05 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:27,
            Percentage of VRAM taken: 69.91579512747062%,
            
[2025-08-20 09:27:01,374][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/policy_optimizer_state.pt
[2025-08-20 09:27:01,375][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/critic_optimizer_state.pt
[2025-08-20 09:27:03,042][__main__][INFO] - Iteration 80 took 55s (37.42% Gen, 62.57% Train). Generation: 20s, Training: 34s. Estimated remaining time: 14h 4m 6s. Estimated total time: 15h 20m 38s. Time estimates for 10 more iterations: 9m 12s, 100 more iterations: 1h 32m 3s, 500 more iterations: 7h 40m 19s.
[2025-08-20 09:27:03,044][__main__][INFO] - Starting iteration 80.
[2025-08-20 09:27:26,068][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 09:27:26,069][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 09:27:26,076][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 09:27:28,524][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 09:27:28,526][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 09:27:28,532][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 09:27:28,535][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data.
[2025-08-20 09:27:28,535][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets.
[2025-08-20 09:27:28,835][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:27:29,627][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:27:30,420][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:27:31,212][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:27:32,006][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:27:32,799][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:27:33,592][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:27:34,386][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:27:35,181][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:27:35,974][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:27:36,769][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:27:37,563][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:27:38,356][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:27:39,150][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:27:39,947][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:27:40,743][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:27:41,538][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:27:42,331][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:27:43,128][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:27:43,925][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:27:44,723][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:27:45,519][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:27:46,314][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:27:47,108][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:27:48,385][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:27:49,182][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:27:49,979][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:27:50,773][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:27:51,568][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:27:52,364][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:27:53,162][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:27:53,959][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:27:55,586][mllm.training.trainer_common][INFO] - For task: Apply reinforce step
,
            ΔVRAM Allocated: 2.288818359375e-05 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:27,
            Percentage of VRAM taken: 69.91579512747062%,
            
[2025-08-20 09:27:56,508][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/policy_optimizer_state.pt
[2025-08-20 09:27:56,510][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/critic_optimizer_state.pt
[2025-08-20 09:27:57,768][__main__][INFO] - Iteration 81 took 54s (37.63% Gen, 62.37% Train). Generation: 20s, Training: 34s. Estimated remaining time: 13h 54m 37s. Estimated total time: 15h 12m 3s. Time estimates for 10 more iterations: 9m 7s, 100 more iterations: 1h 31m 12s, 500 more iterations: 7h 36m 1s.
[2025-08-20 09:27:57,769][__main__][INFO] - Starting iteration 81.
[2025-08-20 09:28:21,108][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 09:28:21,109][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 09:28:21,115][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 09:28:23,557][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 09:28:23,559][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 09:28:23,565][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 09:28:23,567][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data.
[2025-08-20 09:28:23,568][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets.
[2025-08-20 09:28:23,866][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:28:24,656][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:28:25,450][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:28:26,244][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:28:27,037][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:28:27,831][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:28:28,626][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:28:29,421][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:28:30,214][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:28:31,010][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:28:31,806][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:28:32,599][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:28:33,393][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:28:34,189][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:28:34,986][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:28:35,781][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:28:36,575][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:28:37,370][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:28:38,168][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:28:38,966][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:28:39,762][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:28:40,980][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:28:41,773][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:28:42,568][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:28:43,366][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:28:44,163][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:28:44,959][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:28:45,753][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:28:46,548][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:28:47,346][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:28:48,144][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:28:48,941][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:28:50,540][mllm.training.trainer_common][INFO] - For task: Apply reinforce step
,
            ΔVRAM Allocated: 2.288818359375e-05 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:26,
            Percentage of VRAM taken: 69.91579512747062%,
            
[2025-08-20 09:28:51,481][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/policy_optimizer_state.pt
[2025-08-20 09:28:51,482][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/critic_optimizer_state.pt
[2025-08-20 09:28:52,841][__main__][INFO] - Iteration 82 took 55s (37.94% Gen, 62.06% Train). Generation: 20s, Training: 34s. Estimated remaining time: 13h 59m 29s. Estimated total time: 15h 17m 51s. Time estimates for 10 more iterations: 9m 10s, 100 more iterations: 1h 31m 47s, 500 more iterations: 7h 38m 55s.
[2025-08-20 09:28:52,843][__main__][INFO] - Starting iteration 82.
[2025-08-20 09:29:15,911][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 09:29:15,912][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 09:29:15,918][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 09:29:18,363][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 09:29:18,365][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 09:29:18,371][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 09:29:18,373][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data.
[2025-08-20 09:29:18,374][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets.
[2025-08-20 09:29:18,674][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:29:19,464][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:29:20,255][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:29:21,047][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:29:21,838][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:29:22,632][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:29:23,425][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:29:24,216][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:29:25,011][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:29:25,805][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:29:26,597][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:29:27,392][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:29:28,188][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:29:28,981][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:29:29,774][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:29:30,571][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:29:31,366][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:29:32,159][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:29:32,953][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:29:33,750][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:29:34,547][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:29:35,828][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:29:36,623][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:29:37,417][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:29:38,211][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:29:39,009][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:29:39,805][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:29:40,599][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:29:41,392][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:29:42,188][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:29:42,985][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:29:43,780][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:29:45,403][mllm.training.trainer_common][INFO] - For task: Apply reinforce step
,
            ΔVRAM Allocated: 2.288818359375e-05 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:26,
            Percentage of VRAM taken: 69.91579512747062%,
            
[2025-08-20 09:29:46,337][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/policy_optimizer_state.pt
[2025-08-20 09:29:46,338][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/critic_optimizer_state.pt
[2025-08-20 09:29:47,608][__main__][INFO] - Iteration 83 took 54s (37.69% Gen, 62.30% Train). Generation: 20s, Training: 34s. Estimated remaining time: 13h 53m 28s. Estimated total time: 15h 12m 44s. Time estimates for 10 more iterations: 9m 7s, 100 more iterations: 1h 31m 16s, 500 more iterations: 7h 36m 22s.
[2025-08-20 09:29:47,611][__main__][INFO] - Starting iteration 83.
[2025-08-20 09:30:11,949][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 09:30:11,950][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 09:30:11,956][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 09:30:14,410][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 09:30:14,411][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 09:30:14,418][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 09:30:14,420][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data.
[2025-08-20 09:30:14,421][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets.
[2025-08-20 09:30:14,718][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:30:15,510][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:30:16,301][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:30:17,095][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:30:17,888][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:30:18,680][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:30:19,475][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:30:20,268][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:30:21,060][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:30:21,855][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:30:22,651][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:30:23,442][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:30:24,236][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:30:25,033][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:30:25,829][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:30:26,623][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:30:27,417][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:30:28,214][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:30:29,011][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:30:30,243][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:30:31,037][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:30:31,833][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:30:32,631][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:30:33,427][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:30:34,222][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:30:35,016][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:30:35,813][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:30:36,611][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:30:37,409][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:30:38,206][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:30:39,000][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:30:39,795][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:30:41,393][mllm.training.trainer_common][INFO] - For task: Apply reinforce step
,
            ΔVRAM Allocated: 2.288818359375e-05 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:26,
            Percentage of VRAM taken: 69.91579512747062%,
            
[2025-08-20 09:30:42,431][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/policy_optimizer_state.pt
[2025-08-20 09:30:42,434][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/critic_optimizer_state.pt
[2025-08-20 09:30:46,259][__main__][INFO] - Iteration 84 took 58s (37.34% Gen, 62.66% Train). Generation: 21s, Training: 36s. Estimated remaining time: 14h 57m 13s. Estimated total time: 16h 17m 28s. Time estimates for 10 more iterations: 9m 46s, 100 more iterations: 1h 37m 44s, 500 more iterations: 8h 8m 44s.
[2025-08-20 09:30:46,261][__main__][INFO] - Starting iteration 84.
[2025-08-20 09:31:10,846][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 09:31:10,848][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 09:31:10,854][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 09:31:13,290][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 09:31:13,292][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 09:31:13,298][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 09:31:13,300][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data.
[2025-08-20 09:31:13,301][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets.
[2025-08-20 09:31:13,600][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:31:14,391][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:31:15,182][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:31:15,976][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:31:16,769][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:31:17,562][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:31:18,356][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:31:19,149][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:31:19,942][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:31:20,737][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:31:21,533][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:31:22,326][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:31:23,120][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:31:23,916][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:31:24,712][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:31:25,506][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:31:26,301][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:31:27,095][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:31:27,892][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:31:28,687][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:31:29,942][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:31:30,736][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:31:31,533][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:31:32,330][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:31:33,125][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:31:33,919][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:31:34,715][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:31:35,512][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:31:36,308][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:31:37,103][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:31:37,897][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:31:38,694][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:31:40,331][mllm.training.trainer_common][INFO] - For task: Apply reinforce step
,
            ΔVRAM Allocated: 2.288818359375e-05 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:26,
            Percentage of VRAM taken: 69.91579512747062%,
            
[2025-08-20 09:31:41,258][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/policy_optimizer_state.pt
[2025-08-20 09:31:41,260][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/critic_optimizer_state.pt
[2025-08-20 09:31:42,532][__main__][INFO] - Iteration 85 took 56s (39.33% Gen, 60.66% Train). Generation: 22s, Training: 34s. Estimated remaining time: 14h 16m 39s. Estimated total time: 15h 37m 50s. Time estimates for 10 more iterations: 9m 22s, 100 more iterations: 1h 33m 47s, 500 more iterations: 7h 48m 55s.
[2025-08-20 09:31:42,535][__main__][INFO] - Starting iteration 85.
[2025-08-20 09:32:05,700][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 09:32:05,701][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 09:32:05,708][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 09:32:08,165][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 09:32:08,166][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 09:32:08,172][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 09:32:08,175][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data.
[2025-08-20 09:32:08,175][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets.
[2025-08-20 09:32:08,473][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:32:09,262][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:32:10,056][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:32:10,848][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:32:11,734][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:32:12,526][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:32:13,321][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:32:14,118][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:32:14,912][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:32:15,706][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:32:16,501][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:32:17,299][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:32:18,094][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:32:18,888][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:32:19,683][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:32:20,480][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:32:21,277][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:32:22,074][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:32:22,868][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:32:23,662][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:32:24,459][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:32:25,707][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:32:26,500][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:32:27,297][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:32:28,095][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:32:28,891][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:32:29,686][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:32:30,480][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:32:31,277][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:32:32,075][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:32:32,873][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:32:33,670][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:32:35,256][mllm.training.trainer_common][INFO] - For task: Apply reinforce step
,
            ΔVRAM Allocated: 2.288818359375e-05 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:27,
            Percentage of VRAM taken: 69.91579512747062%,
            
[2025-08-20 09:32:36,249][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/policy_optimizer_state.pt
[2025-08-20 09:32:36,251][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/critic_optimizer_state.pt
[2025-08-20 09:32:37,552][__main__][INFO] - Iteration 86 took 55s (37.66% Gen, 62.34% Train). Generation: 20s, Training: 34s. Estimated remaining time: 13h 54m 50s. Estimated total time: 15h 16m 57s. Time estimates for 10 more iterations: 9m 10s, 100 more iterations: 1h 31m 41s, 500 more iterations: 7h 38m 28s.
[2025-08-20 09:32:37,554][__main__][INFO] - Starting iteration 86.
[2025-08-20 09:33:00,831][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 09:33:00,832][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 09:33:00,838][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 09:33:03,293][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 09:33:03,294][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 09:33:03,300][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 09:33:03,302][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data.
[2025-08-20 09:33:03,303][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets.
[2025-08-20 09:33:03,601][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:33:04,389][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:33:05,182][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:33:05,973][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:33:06,766][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:33:07,561][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:33:08,353][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:33:09,145][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:33:09,942][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:33:10,735][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:33:11,529][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:33:12,323][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:33:13,119][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:33:13,913][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:33:14,707][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:33:15,502][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:33:16,299][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:33:17,094][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:33:17,888][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:33:18,682][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:33:19,959][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:33:20,754][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:33:21,549][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:33:22,343][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:33:23,141][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:33:23,938][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:33:24,736][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:33:25,531][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:33:26,326][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:33:27,122][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:33:27,920][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:33:28,892][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:33:30,446][mllm.training.trainer_common][INFO] - For task: Apply reinforce step
,
            ΔVRAM Allocated: 2.288818359375e-05 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:27,
            Percentage of VRAM taken: 69.91579512747062%,
            
[2025-08-20 09:33:31,392][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/policy_optimizer_state.pt
[2025-08-20 09:33:31,393][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/critic_optimizer_state.pt
[2025-08-20 09:33:32,818][__main__][INFO] - Iteration 87 took 55s (37.69% Gen, 62.31% Train). Generation: 20s, Training: 34s. Estimated remaining time: 13h 58m 2s. Estimated total time: 15h 21m 3s. Time estimates for 10 more iterations: 9m 12s, 100 more iterations: 1h 32m 6s, 500 more iterations: 7h 40m 31s.
[2025-08-20 09:33:32,820][__main__][INFO] - Starting iteration 87.
[2025-08-20 09:33:56,371][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 09:33:56,372][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 09:33:56,378][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 09:33:58,848][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 09:33:58,850][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 09:33:58,857][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 09:33:58,859][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data.
[2025-08-20 09:33:58,860][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets.
[2025-08-20 09:33:59,159][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:33:59,950][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:34:00,915][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:34:01,706][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:34:02,500][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:34:03,292][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:34:04,086][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:34:04,880][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:34:05,673][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:34:06,467][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:34:07,264][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:34:08,057][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:34:08,851][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:34:09,646][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:34:10,443][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:34:11,239][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:34:12,033][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:34:12,827][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:34:13,624][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:34:14,422][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:34:15,710][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:34:16,505][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:34:17,303][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:34:18,100][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:34:18,897][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:34:19,691][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:34:20,486][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:34:21,283][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:34:22,080][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:34:22,876][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:34:23,672][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:34:24,465][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:34:26,061][mllm.training.trainer_common][INFO] - For task: Apply reinforce step
,
            ΔVRAM Allocated: 2.288818359375e-05 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:27,
            Percentage of VRAM taken: 69.91579512747062%,
            
[2025-08-20 09:34:27,008][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/policy_optimizer_state.pt
[2025-08-20 09:34:27,010][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/critic_optimizer_state.pt
[2025-08-20 09:34:28,252][__main__][INFO] - Iteration 88 took 55s (38.07% Gen, 61.92% Train). Generation: 21s, Training: 34s. Estimated remaining time: 13h 59m 54s. Estimated total time: 15h 23m 51s. Time estimates for 10 more iterations: 9m 14s, 100 more iterations: 1h 32m 23s, 500 more iterations: 7h 41m 55s.
[2025-08-20 09:34:28,253][__main__][INFO] - Starting iteration 88.
[2025-08-20 09:34:51,482][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 09:34:51,484][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 09:34:51,490][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 09:34:53,957][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 09:34:53,958][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 09:34:53,965][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 09:34:53,967][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data.
[2025-08-20 09:34:53,968][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets.
[2025-08-20 09:34:54,266][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:34:55,056][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:34:55,847][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:34:56,639][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:34:57,431][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:34:58,223][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:34:59,015][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:34:59,809][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:35:00,605][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:35:01,398][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:35:02,191][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:35:02,987][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:35:03,782][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:35:04,574][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:35:05,368][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:35:06,165][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:35:06,961][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:35:07,754][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:35:08,548][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:35:09,345][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:35:10,141][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:35:10,935][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:35:12,163][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:35:12,956][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:35:13,752][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:35:14,548][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:35:17,372][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:35:18,165][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:35:18,962][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:35:19,756][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:35:20,549][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:35:21,342][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:35:22,912][mllm.training.trainer_common][INFO] - For task: Apply reinforce step
,
            ΔVRAM Allocated: 2.288818359375e-05 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:28,
            Percentage of VRAM taken: 69.91579512747062%,
            
[2025-08-20 09:35:23,834][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/policy_optimizer_state.pt
[2025-08-20 09:35:23,835][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/critic_optimizer_state.pt
[2025-08-20 09:35:25,369][__main__][INFO] - Iteration 89 took 57s (36.36% Gen, 63.63% Train). Generation: 20s, Training: 36s. Estimated remaining time: 14h 27m 1s. Estimated total time: 15h 51m 55s. Time estimates for 10 more iterations: 9m 31s, 100 more iterations: 1h 35m 11s, 500 more iterations: 7h 55m 57s.
[2025-08-20 09:35:25,371][__main__][INFO] - Starting iteration 89.
[2025-08-20 09:35:48,512][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 09:35:48,513][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 09:35:48,520][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 09:35:50,997][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 09:35:50,999][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 09:35:51,005][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 09:35:51,007][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data.
[2025-08-20 09:35:51,008][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets.
[2025-08-20 09:35:51,308][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:35:52,099][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:35:52,892][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:35:53,688][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:35:54,482][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:35:55,275][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:35:56,072][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:35:56,870][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:35:57,667][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:35:58,461][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:35:59,255][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:36:00,050][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:36:00,848][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:36:01,644][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:36:02,438][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:36:03,232][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:36:04,030][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:36:04,827][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:36:05,624][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:36:06,903][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:36:07,697][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:36:08,491][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:36:09,289][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:36:10,087][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:36:10,885][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:36:11,682][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:36:12,477][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:36:13,272][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:36:14,067][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:36:14,863][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:36:15,657][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:36:16,452][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:36:18,049][mllm.training.trainer_common][INFO] - For task: Apply reinforce step
,
            ΔVRAM Allocated: 2.288818359375e-05 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:27,
            Percentage of VRAM taken: 69.91579512747062%,
            
[2025-08-20 09:36:18,968][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/policy_optimizer_state.pt
[2025-08-20 09:36:18,969][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/critic_optimizer_state.pt
[2025-08-20 09:36:20,221][__main__][INFO] - Iteration 90 took 54s (37.71% Gen, 62.29% Train). Generation: 20s, Training: 34s. Estimated remaining time: 13h 48m 20s. Estimated total time: 15h 14m 9s. Time estimates for 10 more iterations: 9m 8s, 100 more iterations: 1h 31m 24s, 500 more iterations: 7h 37m 4s.
[2025-08-20 09:36:20,222][__main__][INFO] - Starting iteration 90.
[2025-08-20 09:36:43,219][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 09:36:43,220][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 09:36:43,226][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 09:36:45,688][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 09:36:45,689][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 09:36:45,696][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 09:36:45,698][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data.
[2025-08-20 09:36:45,698][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets.
[2025-08-20 09:36:45,996][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:36:46,790][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:36:47,581][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:36:48,374][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:36:49,168][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:36:49,960][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:36:50,756][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:36:51,553][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:36:52,350][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:36:53,144][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:36:53,938][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:36:54,732][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:36:55,528][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:36:56,323][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:36:57,117][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:36:57,913][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:36:59,143][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:36:59,936][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:37:00,733][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:37:01,531][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:37:02,328][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:37:03,123][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:37:03,917][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:37:04,712][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:37:05,508][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:37:06,304][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:37:07,099][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:37:07,895][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:37:08,691][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:37:09,489][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:37:10,288][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:37:11,085][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:37:12,674][mllm.training.trainer_common][INFO] - For task: Apply reinforce step
,
            ΔVRAM Allocated: 2.288818359375e-05 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:26,
            Percentage of VRAM taken: 69.91579512747062%,
            
[2025-08-20 09:37:13,641][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/policy_optimizer_state.pt
[2025-08-20 09:37:13,642][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/critic_optimizer_state.pt
[2025-08-20 09:37:15,097][__main__][INFO] - Iteration 91 took 54s (37.45% Gen, 62.55% Train). Generation: 20s, Training: 34s. Estimated remaining time: 13h 47m 50s. Estimated total time: 15h 14m 34s. Time estimates for 10 more iterations: 9m 8s, 100 more iterations: 1h 31m 27s, 500 more iterations: 7h 37m 17s.
[2025-08-20 09:37:15,098][__main__][INFO] - Starting iteration 91.
[2025-08-20 09:37:38,768][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 09:37:38,770][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 09:37:38,776][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 09:37:41,222][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 09:37:41,223][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 09:37:41,229][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 09:37:41,232][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data.
[2025-08-20 09:37:41,232][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets.
[2025-08-20 09:37:41,531][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:37:42,321][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:37:43,116][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:37:43,908][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:37:44,700][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:37:45,496][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:37:46,291][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:37:47,084][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:37:47,879][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:37:48,675][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:37:49,470][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:37:50,264][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:37:51,059][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:37:51,855][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:37:53,643][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:37:54,435][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:37:55,231][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:37:56,024][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:37:56,817][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:37:57,612][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:37:58,406][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:37:59,199][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:37:59,993][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:38:01,247][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:38:02,041][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:38:02,835][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:38:03,632][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:38:04,430][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:38:05,226][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:38:06,018][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:38:06,811][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:38:07,608][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:38:09,169][mllm.training.trainer_common][INFO] - For task: Apply reinforce step
,
            ΔVRAM Allocated: 2.288818359375e-05 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:27,
            Percentage of VRAM taken: 69.91579512747062%,
            
[2025-08-20 09:38:10,110][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/policy_optimizer_state.pt
[2025-08-20 09:38:10,112][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/critic_optimizer_state.pt
[2025-08-20 09:38:11,388][__main__][INFO] - Iteration 92 took 56s (37.72% Gen, 62.28% Train). Generation: 21s, Training: 35s. Estimated remaining time: 14h 10m 28s. Estimated total time: 15h 38m 8s. Time estimates for 10 more iterations: 9m 22s, 100 more iterations: 1h 33m 48s, 500 more iterations: 7h 49m 4s.
[2025-08-20 09:38:11,389][__main__][INFO] - Starting iteration 92.
[2025-08-20 09:38:34,503][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 09:38:34,505][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 09:38:34,511][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 09:38:36,959][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 09:38:36,960][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 09:38:36,966][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 09:38:36,969][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data.
[2025-08-20 09:38:36,969][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets.
[2025-08-20 09:38:37,267][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:38:38,057][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:38:38,849][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:38:39,641][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:38:40,438][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:38:41,232][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:38:42,025][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:38:42,819][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:38:43,616][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:38:44,410][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:38:45,204][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:38:45,998][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:38:46,795][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:38:47,589][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:38:48,383][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:38:49,179][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:38:49,976][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:38:50,772][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:38:51,567][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:38:52,361][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:38:53,610][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:38:54,403][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:38:55,196][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:38:55,992][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:38:56,785][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:38:57,578][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:38:58,375][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:38:59,172][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:38:59,966][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:39:00,757][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:39:01,552][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:39:02,346][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:39:03,934][mllm.training.trainer_common][INFO] - For task: Apply reinforce step
,
            ΔVRAM Allocated: 2.288818359375e-05 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:26,
            Percentage of VRAM taken: 69.91579512747062%,
            
[2025-08-20 09:39:04,882][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/policy_optimizer_state.pt
[2025-08-20 09:39:04,883][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/critic_optimizer_state.pt
[2025-08-20 09:39:06,185][__main__][INFO] - Iteration 93 took 54s (37.76% Gen, 62.24% Train). Generation: 20s, Training: 34s. Estimated remaining time: 13h 44m 40s. Estimated total time: 15h 13m 15s. Time estimates for 10 more iterations: 9m 7s, 100 more iterations: 1h 31m 19s, 500 more iterations: 7h 36m 37s.
[2025-08-20 09:39:06,187][__main__][INFO] - Starting iteration 93.
[2025-08-20 09:39:29,226][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 09:39:29,227][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 09:39:29,233][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 09:39:31,677][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 09:39:31,678][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 09:39:31,685][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 09:39:31,687][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data.
[2025-08-20 09:39:31,688][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets.
[2025-08-20 09:39:31,987][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:39:32,779][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:39:33,570][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:39:34,363][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:39:35,159][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:39:35,952][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:39:36,745][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:39:37,542][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:39:38,338][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:39:39,132][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:39:39,926][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:39:40,721][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:39:41,518][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:39:42,311][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:39:43,106][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:39:43,901][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:39:44,700][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:39:45,498][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:39:46,293][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:39:47,087][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:39:47,882][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:39:48,679][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:39:49,477][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:39:50,803][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:39:51,599][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:39:52,397][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:39:53,194][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:39:53,989][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:39:54,784][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:39:55,578][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:39:56,374][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:39:57,168][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:39:58,730][mllm.training.trainer_common][INFO] - For task: Apply reinforce step
,
            ΔVRAM Allocated: 2.288818359375e-05 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:27,
            Percentage of VRAM taken: 69.91579512747062%,
            
[2025-08-20 09:39:59,654][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/policy_optimizer_state.pt
[2025-08-20 09:39:59,655][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/critic_optimizer_state.pt
[2025-08-20 09:40:00,999][__main__][INFO] - Iteration 94 took 54s (37.58% Gen, 62.42% Train). Generation: 20s, Training: 34s. Estimated remaining time: 13h 44m 1s. Estimated total time: 15h 13m 31s. Time estimates for 10 more iterations: 9m 8s, 100 more iterations: 1h 31m 21s, 500 more iterations: 7h 36m 45s.
[2025-08-20 09:40:01,000][__main__][INFO] - Starting iteration 94.
[2025-08-20 09:40:24,120][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 09:40:24,122][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 09:40:24,128][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 09:40:26,604][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 09:40:26,605][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 09:40:26,611][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 09:40:26,614][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data.
[2025-08-20 09:40:26,614][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets.
[2025-08-20 09:40:26,912][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:40:27,703][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:40:28,495][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:40:29,287][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:40:30,084][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:40:30,879][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:40:31,672][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:40:32,466][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:40:33,263][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:40:34,057][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:40:34,852][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:40:35,646][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:40:36,443][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:40:37,240][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:40:38,035][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:40:38,829][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:40:39,625][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:40:40,422][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:40:41,220][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:40:42,015][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:40:42,809][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:40:43,604][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:40:44,399][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:40:45,195][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:40:45,989][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:40:46,783][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:40:48,052][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:40:48,847][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:40:49,642][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:40:50,438][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:40:51,233][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:40:52,027][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:40:53,597][mllm.training.trainer_common][INFO] - For task: Apply reinforce step
,
            ΔVRAM Allocated: 2.288818359375e-05 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:26,
            Percentage of VRAM taken: 69.91579512747062%,
            
[2025-08-20 09:40:54,636][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/policy_optimizer_state.pt
[2025-08-20 09:40:54,638][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/critic_optimizer_state.pt
[2025-08-20 09:40:55,920][__main__][INFO] - Iteration 95 took 54s (37.64% Gen, 62.36% Train). Generation: 20s, Training: 34s. Estimated remaining time: 13h 44m 54s. Estimated total time: 15h 15m 19s. Time estimates for 10 more iterations: 9m 9s, 100 more iterations: 1h 31m 31s, 500 more iterations: 7h 37m 39s.
[2025-08-20 09:40:55,922][__main__][INFO] - Starting iteration 95.
[2025-08-20 09:41:18,957][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 09:41:18,958][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 09:41:18,964][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 09:41:21,410][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 09:41:21,412][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 09:41:21,418][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 09:41:21,421][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data.
[2025-08-20 09:41:21,421][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets.
[2025-08-20 09:41:21,719][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:41:22,508][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:41:23,302][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:41:24,094][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:41:24,888][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:41:25,683][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:41:26,477][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:41:27,271][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:41:28,067][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:41:28,864][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:41:29,659][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:41:30,452][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:41:31,248][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:41:32,045][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:41:32,841][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:41:33,636][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:41:34,430][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:41:35,226][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:41:36,024][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:41:36,820][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:41:37,615][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:41:38,409][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:41:39,206][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:41:40,004][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:41:40,801][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:41:42,034][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:41:42,828][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:41:43,625][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:41:44,422][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:41:45,217][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:41:46,009][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:41:46,804][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:41:48,353][mllm.training.trainer_common][INFO] - For task: Apply reinforce step
,
            ΔVRAM Allocated: 2.288818359375e-05 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:26,
            Percentage of VRAM taken: 69.91579512747062%,
            
[2025-08-20 09:41:49,296][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/policy_optimizer_state.pt
[2025-08-20 09:41:49,297][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/critic_optimizer_state.pt
[2025-08-20 09:41:50,664][__main__][INFO] - Iteration 96 took 54s (37.66% Gen, 62.34% Train). Generation: 20s, Training: 34s. Estimated remaining time: 13h 41m 2s. Estimated total time: 15h 12m 21s. Time estimates for 10 more iterations: 9m 7s, 100 more iterations: 1h 31m 14s, 500 more iterations: 7h 36m 10s.
[2025-08-20 09:41:50,665][__main__][INFO] - Starting iteration 96.
[2025-08-20 09:42:13,736][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 09:42:13,738][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 09:42:13,744][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 09:42:16,203][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 09:42:16,204][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 09:42:16,211][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 09:42:16,213][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data.
[2025-08-20 09:42:16,213][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets.
[2025-08-20 09:42:16,513][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:42:17,303][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:42:18,094][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:42:18,888][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:42:19,680][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:42:20,471][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:42:21,264][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:42:22,056][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:42:22,851][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:42:23,645][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:42:24,437][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:42:25,230][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:42:26,025][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:42:26,818][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:42:27,611][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:42:28,406][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:42:29,200][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:42:29,992][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:42:30,788][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:42:31,583][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:42:32,863][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:42:33,656][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:42:34,451][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:42:35,247][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:42:36,044][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:42:36,840][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:42:37,637][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:42:38,430][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:42:39,226][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:42:40,022][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:42:40,816][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:42:41,609][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:42:43,176][mllm.training.trainer_common][INFO] - For task: Apply reinforce step
,
            ΔVRAM Allocated: 2.288818359375e-05 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:26,
            Percentage of VRAM taken: 69.91579512747062%,
            
[2025-08-20 09:42:44,186][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/policy_optimizer_state.pt
[2025-08-20 09:42:44,189][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/critic_optimizer_state.pt
[2025-08-20 09:42:45,443][__main__][INFO] - Iteration 97 took 54s (37.65% Gen, 62.35% Train). Generation: 20s, Training: 34s. Estimated remaining time: 13h 40m 43s. Estimated total time: 15h 12m 57s. Time estimates for 10 more iterations: 9m 7s, 100 more iterations: 1h 31m 17s, 500 more iterations: 7h 36m 28s.
[2025-08-20 09:42:45,445][__main__][INFO] - Starting iteration 97.
[2025-08-20 09:43:08,995][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 09:43:08,996][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 09:43:09,003][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 09:43:11,467][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 09:43:11,468][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 09:43:11,474][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 09:43:11,477][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data.
[2025-08-20 09:43:11,477][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets.
[2025-08-20 09:43:11,776][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:43:12,569][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:43:13,362][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:43:14,155][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:43:14,952][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:43:15,745][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:43:16,538][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:43:17,334][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:43:18,132][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:43:18,927][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:43:19,721][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:43:20,516][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:43:21,313][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:43:22,110][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:43:22,906][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:43:24,129][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:43:24,924][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:43:25,718][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:43:26,513][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:43:27,310][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:43:28,108][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:43:28,903][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:43:29,698][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:43:30,492][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:43:31,290][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:43:32,088][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:43:32,883][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:43:33,676][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:43:34,469][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:43:35,266][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:43:36,060][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:43:36,854][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:43:38,407][mllm.training.trainer_common][INFO] - For task: Apply reinforce step
,
            ΔVRAM Allocated: 2.288818359375e-05 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:26,
            Percentage of VRAM taken: 69.91579512747062%,
            
[2025-08-20 09:43:39,342][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/policy_optimizer_state.pt
[2025-08-20 09:43:39,343][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/critic_optimizer_state.pt
[2025-08-20 09:43:41,915][__main__][INFO] - Iteration 98 took 56s (37.36% Gen, 62.64% Train). Generation: 21s, Training: 35s. Estimated remaining time: 14h 7m 58s. Estimated total time: 15h 41m 9s. Time estimates for 10 more iterations: 9m 24s, 100 more iterations: 1h 34m 6s, 500 more iterations: 7h 50m 34s.
[2025-08-20 09:43:41,917][__main__][INFO] - Starting iteration 98.
[2025-08-20 09:44:05,019][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 09:44:05,021][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 09:44:05,027][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 09:44:07,474][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 09:44:07,476][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 09:44:07,482][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 09:44:07,484][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data.
[2025-08-20 09:44:07,485][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets.
[2025-08-20 09:44:07,783][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:44:08,573][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:44:09,362][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:44:10,154][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:44:10,945][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:44:11,737][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:44:12,531][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:44:13,322][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:44:14,115][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:44:14,907][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:44:15,699][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:44:16,493][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:44:17,286][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:44:18,078][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:44:18,873][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:44:19,667][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:44:20,459][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:44:21,254][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:44:22,048][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:44:22,841][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:44:24,087][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:44:24,882][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:44:25,677][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:44:26,473][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:44:27,269][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:44:28,063][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:44:28,856][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:44:29,651][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:44:30,447][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:44:31,241][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:44:32,035][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:44:32,831][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:44:34,382][mllm.training.trainer_common][INFO] - For task: Apply reinforce step
,
            ΔVRAM Allocated: 2.288818359375e-05 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:26,
            Percentage of VRAM taken: 69.91579512747062%,
            
[2025-08-20 09:44:35,323][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/policy_optimizer_state.pt
[2025-08-20 09:44:35,324][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/critic_optimizer_state.pt
[2025-08-20 09:44:36,675][__main__][INFO] - Iteration 99 took 54s (37.74% Gen, 62.26% Train). Generation: 20s, Training: 34s. Estimated remaining time: 13h 38m 32s. Estimated total time: 15h 12m 38s. Time estimates for 10 more iterations: 9m 7s, 100 more iterations: 1h 31m 15s, 500 more iterations: 7h 36m 19s.
[2025-08-20 09:44:36,677][__main__][INFO] - Starting iteration 99.
[2025-08-20 09:45:04,028][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 09:45:04,030][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 09:45:04,036][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 09:45:06,513][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 09:45:06,515][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 09:45:06,521][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 09:45:06,524][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data.
[2025-08-20 09:45:06,524][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets.
[2025-08-20 09:45:06,823][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:45:07,615][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:45:08,407][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:45:09,200][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:45:09,996][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:45:10,788][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:45:11,581][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:45:12,376][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:45:13,172][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:45:13,964][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:45:14,757][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:45:15,554][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:45:16,348][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:45:17,142][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:45:17,938][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:45:18,735][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:45:19,530][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:45:20,324][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:45:21,118][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:45:21,915][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:45:22,712][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:45:23,960][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:45:24,751][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:45:25,545][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:45:26,337][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:45:27,130][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:45:27,921][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:45:28,712][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:45:29,506][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:45:30,297][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:45:31,090][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:45:31,884][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:45:33,474][mllm.training.trainer_common][INFO] - For task: Apply reinforce step
,
            ΔVRAM Allocated: 2.288818359375e-05 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:26,
            Percentage of VRAM taken: 69.91579512747062%,
            
[2025-08-20 09:45:34,430][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/policy_optimizer_state.pt
[2025-08-20 09:45:34,431][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/critic_optimizer_state.pt
[2025-08-20 09:45:35,776][__main__][INFO] - Iteration 100 took 59s (42.10% Gen, 57.90% Train). Generation: 24s, Training: 34s. Estimated remaining time: 14h 49m 53s. Estimated total time: 16h 24m 58s. Time estimates for 10 more iterations: 9m 50s, 100 more iterations: 1h 38m 29s, 500 more iterations: 8h 12m 29s.
[2025-08-20 09:45:35,777][__main__][INFO] - Starting iteration 100.
[2025-08-20 09:45:58,929][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 09:45:58,930][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 09:45:58,936][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 09:46:01,386][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 09:46:01,387][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 09:46:01,393][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 09:46:01,396][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data.
[2025-08-20 09:46:01,396][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets.
[2025-08-20 09:46:01,695][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:46:02,484][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:46:03,280][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:46:04,073][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:46:04,866][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:46:05,660][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:46:06,456][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:46:07,249][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:46:08,043][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:46:08,839][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:46:09,635][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:46:10,427][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:46:11,222][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:46:12,019][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:46:12,815][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:46:13,611][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:46:14,405][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:46:15,200][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:46:15,997][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:46:16,794][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:46:17,588][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:46:18,843][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:46:19,638][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:46:20,433][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:46:21,229][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:46:22,023][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:46:22,818][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:46:23,615][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:46:24,410][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:46:25,202][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:46:25,995][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:46:26,792][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:46:28,350][mllm.training.trainer_common][INFO] - For task: Apply reinforce step
,
            ΔVRAM Allocated: 2.288818359375e-05 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:26,
            Percentage of VRAM taken: 69.91579512747062%,
            
[2025-08-20 09:46:29,302][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/policy_optimizer_state.pt
[2025-08-20 09:46:29,304][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/critic_optimizer_state.pt
[2025-08-20 09:46:33,802][__main__][INFO] - Iteration 101 took 58s (35.67% Gen, 58.96% Train). Generation: 20s, Training: 34s. Estimated remaining time: 14h 31m 1s. Estimated total time: 16h 7m 4s. Time estimates for 10 more iterations: 9m 40s, 100 more iterations: 1h 36m 42s, 500 more iterations: 8h 3m 32s.
[2025-08-20 09:46:33,804][__main__][INFO] - Starting iteration 101.
[2025-08-20 09:46:57,387][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 09:46:59,767][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 09:46:59,775][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 09:47:02,252][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 09:47:02,253][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 09:47:02,259][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 09:47:02,261][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data.
[2025-08-20 09:47:02,262][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets.
[2025-08-20 09:47:02,563][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:47:03,354][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:47:04,144][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:47:04,939][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:47:05,732][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:47:06,524][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:47:07,320][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:47:08,115][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:47:08,908][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:47:09,702][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:47:10,497][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:47:11,290][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:47:12,082][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:47:12,879][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:47:13,675][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:47:14,468][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:47:15,262][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:47:16,058][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:47:16,853][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:47:17,646][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:47:18,899][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:47:19,693][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:47:20,484][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:47:21,276][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:47:22,070][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:47:22,862][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:47:23,657][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:47:24,452][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:47:25,244][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:47:26,037][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:47:26,833][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:47:27,628][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:47:29,213][mllm.training.trainer_common][INFO] - For task: Apply reinforce step
,
            ΔVRAM Allocated: 2.288818359375e-05 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:26,
            Percentage of VRAM taken: 69.91579512747062%,
            
[2025-08-20 09:47:30,153][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/policy_optimizer_state.pt
[2025-08-20 09:47:30,155][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/critic_optimizer_state.pt
[2025-08-20 09:47:31,475][__main__][INFO] - Iteration 102 took 57s (36.60% Gen, 63.39% Train). Generation: 21s, Training: 36s. Estimated remaining time: 14h 24m 9s. Estimated total time: 16h 1m 10s. Time estimates for 10 more iterations: 9m 36s, 100 more iterations: 1h 36m 7s, 500 more iterations: 8h 0m 35s.
[2025-08-20 09:47:31,476][__main__][INFO] - Starting iteration 102.
[2025-08-20 09:47:54,593][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 09:47:54,594][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 09:47:54,600][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 09:47:57,074][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 09:47:57,075][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 09:47:57,082][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 09:47:57,084][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data.
[2025-08-20 09:47:57,085][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets.
[2025-08-20 09:47:57,383][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:47:58,174][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:47:58,967][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:47:59,763][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:48:00,557][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:48:01,350][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:48:02,145][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:48:02,941][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:48:03,735][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:48:04,528][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:48:05,324][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:48:06,121][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:48:06,916][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:48:07,709][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:48:08,504][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:48:09,301][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:48:10,097][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:48:10,892][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:48:11,685][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:48:12,480][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:48:13,273][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:48:14,066][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:48:14,860][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:48:15,655][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:48:16,448][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:48:17,241][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:48:18,038][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:48:19,238][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:48:20,032][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:48:20,825][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:48:21,620][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:48:22,416][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:48:23,988][mllm.training.trainer_common][INFO] - For task: Apply reinforce step
,
            ΔVRAM Allocated: 2.288818359375e-05 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:26,
            Percentage of VRAM taken: 69.91579512747062%,
            
[2025-08-20 09:48:28,598][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/policy_optimizer_state.pt
[2025-08-20 09:48:28,600][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/critic_optimizer_state.pt
[2025-08-20 09:48:29,956][__main__][INFO] - Iteration 103 took 58s (35.32% Gen, 64.68% Train). Generation: 20s, Training: 37s. Estimated remaining time: 14h 36m 40s. Estimated total time: 16h 14m 39s. Time estimates for 10 more iterations: 9m 44s, 100 more iterations: 1h 37m 27s, 500 more iterations: 8h 7m 19s.
[2025-08-20 09:48:29,957][__main__][INFO] - Starting iteration 103.
[2025-08-20 09:48:53,173][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 09:48:53,174][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 09:48:53,181][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 09:48:55,663][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 09:48:55,664][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 09:48:55,671][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 09:48:55,673][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data.
[2025-08-20 09:48:55,674][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets.
[2025-08-20 09:48:55,977][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:48:56,767][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:48:57,561][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:48:58,353][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:48:59,147][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:48:59,943][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:49:00,736][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:49:01,529][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:49:02,326][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:49:03,121][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:49:03,915][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:49:04,709][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:49:05,504][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:49:06,299][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:49:07,090][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:49:07,884][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:49:08,678][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:49:09,470][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:49:10,264][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:49:11,059][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:49:11,852][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:49:12,645][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:49:13,912][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:49:14,703][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:49:15,497][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:49:16,288][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:49:17,081][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:49:17,874][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:49:18,666][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:49:19,459][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:49:20,253][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:49:21,045][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:49:22,645][mllm.training.trainer_common][INFO] - For task: Apply reinforce step
,
            ΔVRAM Allocated: 2.288818359375e-05 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:26,
            Percentage of VRAM taken: 69.91579512747062%,
            
[2025-08-20 09:49:23,552][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/policy_optimizer_state.pt
[2025-08-20 09:49:23,554][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/critic_optimizer_state.pt
[2025-08-20 09:49:24,839][__main__][INFO] - Iteration 104 took 54s (37.80% Gen, 62.20% Train). Generation: 20s, Training: 34s. Estimated remaining time: 13h 35m 47s. Estimated total time: 15h 14m 41s. Time estimates for 10 more iterations: 9m 8s, 100 more iterations: 1h 31m 28s, 500 more iterations: 7h 37m 20s.
[2025-08-20 09:49:24,841][__main__][INFO] - Starting iteration 104.
[2025-08-20 09:49:48,099][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 09:49:48,101][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 09:49:48,108][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 09:49:50,571][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 09:49:50,572][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 09:49:50,579][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 09:49:50,581][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data.
[2025-08-20 09:49:50,581][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets.
[2025-08-20 09:49:50,879][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:49:51,668][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:49:52,460][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:49:53,251][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:49:54,045][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:49:54,837][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:49:55,630][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:49:56,424][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:49:57,217][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:49:58,009][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:49:58,806][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:49:59,601][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:50:00,393][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:50:01,186][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:50:01,979][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:50:02,770][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:50:03,565][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:50:04,356][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:50:05,148][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:50:05,943][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:50:06,735][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:50:07,953][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:50:08,747][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:50:09,543][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:50:10,337][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:50:11,130][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:50:11,925][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:50:12,722][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:50:13,517][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:50:14,310][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:50:15,105][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:50:15,903][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:50:17,495][mllm.training.trainer_common][INFO] - For task: Apply reinforce step
,
            ΔVRAM Allocated: 2.288818359375e-05 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:26,
            Percentage of VRAM taken: 69.91579512747062%,
            
[2025-08-20 09:50:18,424][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/policy_optimizer_state.pt
[2025-08-20 09:50:18,426][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/critic_optimizer_state.pt
[2025-08-20 09:50:19,702][__main__][INFO] - Iteration 105 took 54s (37.93% Gen, 62.07% Train). Generation: 20s, Training: 34s. Estimated remaining time: 13h 34m 32s. Estimated total time: 15h 14m 21s. Time estimates for 10 more iterations: 9m 8s, 100 more iterations: 1h 31m 26s, 500 more iterations: 7h 37m 10s.
[2025-08-20 09:50:19,704][__main__][INFO] - Starting iteration 105.
[2025-08-20 09:50:42,899][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 09:50:42,900][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 09:50:42,907][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 09:50:45,376][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 09:50:45,377][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 09:50:45,384][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 09:50:45,387][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data.
[2025-08-20 09:50:45,387][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets.
[2025-08-20 09:50:45,686][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:50:46,476][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:50:47,272][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:50:48,067][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:50:48,860][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:50:49,654][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:50:50,452][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:50:51,247][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:50:52,041][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:50:52,835][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:50:53,630][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:50:54,426][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:50:55,219][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:50:56,011][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:50:56,808][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:50:57,601][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:50:58,394][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:50:59,190][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:50:59,987][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:51:00,781][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:51:02,022][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:51:02,815][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:51:03,609][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:51:04,403][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:51:05,196][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:51:05,990][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:51:06,787][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:51:07,583][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:51:08,377][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:51:09,171][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:51:09,967][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:51:10,763][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:51:12,310][mllm.training.trainer_common][INFO] - For task: Apply reinforce step
,
            ΔVRAM Allocated: 2.288818359375e-05 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:26,
            Percentage of VRAM taken: 69.91579512747062%,
            
[2025-08-20 09:51:13,209][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/policy_optimizer_state.pt
[2025-08-20 09:51:13,210][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/critic_optimizer_state.pt
[2025-08-20 09:51:14,523][__main__][INFO] - Iteration 106 took 54s (37.82% Gen, 62.17% Train). Generation: 20s, Training: 34s. Estimated remaining time: 13h 32m 55s. Estimated total time: 15h 13m 38s. Time estimates for 10 more iterations: 9m 8s, 100 more iterations: 1h 31m 21s, 500 more iterations: 7h 36m 49s.
[2025-08-20 09:51:14,524][__main__][INFO] - Starting iteration 106.
[2025-08-20 09:51:37,995][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 09:51:37,996][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 09:51:38,003][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 09:51:40,472][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 09:51:40,474][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 09:51:40,481][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 09:51:40,483][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data.
[2025-08-20 09:51:40,483][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets.
[2025-08-20 09:51:40,782][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:51:41,574][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:51:42,369][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:51:43,162][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:51:43,957][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:51:44,754][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:51:45,549][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:51:46,342][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:51:47,138][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:51:47,932][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:51:48,727][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:51:49,519][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:51:50,314][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:51:51,110][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:51:51,902][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:51:52,694][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:51:53,490][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:51:54,284][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:51:55,531][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:51:56,326][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:51:57,120][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:51:57,914][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:51:58,708][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:51:59,505][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:52:00,299][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:52:01,093][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:52:01,890][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:52:02,687][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:52:03,482][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:52:04,276][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:52:05,071][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:52:05,868][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:52:07,452][mllm.training.trainer_common][INFO] - For task: Apply reinforce step
,
            ΔVRAM Allocated: 2.288818359375e-05 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:26,
            Percentage of VRAM taken: 69.91579512747062%,
            
[2025-08-20 09:52:08,376][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/policy_optimizer_state.pt
[2025-08-20 09:52:08,377][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/critic_optimizer_state.pt
[2025-08-20 09:52:09,732][__main__][INFO] - Iteration 107 took 55s (38.04% Gen, 61.96% Train). Generation: 21s, Training: 34s. Estimated remaining time: 13h 38m 28s. Estimated total time: 15h 20m 6s. Time estimates for 10 more iterations: 9m 12s, 100 more iterations: 1h 32m 0s, 500 more iterations: 7h 40m 3s.
[2025-08-20 09:52:09,733][__main__][INFO] - Starting iteration 107.
[2025-08-20 09:52:33,389][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 09:52:33,390][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 09:52:33,396][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 09:52:35,856][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 09:52:35,858][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 09:52:35,864][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 09:52:35,866][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data.
[2025-08-20 09:52:35,867][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets.
[2025-08-20 09:52:36,166][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:52:36,957][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:52:37,752][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:52:38,544][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:52:39,339][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:52:40,135][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:52:40,930][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:52:41,724][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:52:42,519][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:52:43,316][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:52:44,110][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:52:44,902][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:52:45,696][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:52:46,491][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:52:47,284][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:52:48,077][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:52:48,874][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:52:49,669][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:52:50,462][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:52:51,256][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:52:52,053][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:52:52,849][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:52:53,642][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:52:54,839][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:52:55,633][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:52:56,427][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:52:57,220][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:52:58,013][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:52:58,809][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:52:59,603][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:53:00,395][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:53:01,191][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:53:02,773][mllm.training.trainer_common][INFO] - For task: Apply reinforce step
,
            ΔVRAM Allocated: 2.288818359375e-05 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:26,
            Percentage of VRAM taken: 69.91579512747062%,
            
[2025-08-20 09:53:03,688][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/policy_optimizer_state.pt
[2025-08-20 09:53:03,689][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/critic_optimizer_state.pt
[2025-08-20 09:53:06,040][__main__][INFO] - Iteration 108 took 56s (37.64% Gen, 62.36% Train). Generation: 21s, Training: 35s. Estimated remaining time: 13h 55m 51s. Estimated total time: 15h 38m 26s. Time estimates for 10 more iterations: 9m 23s, 100 more iterations: 1h 33m 50s, 500 more iterations: 7h 49m 13s.
[2025-08-20 09:53:06,042][__main__][INFO] - Starting iteration 108.
[2025-08-20 09:53:29,380][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 09:53:29,381][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 09:53:29,387][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 09:53:31,844][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 09:53:31,845][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 09:53:31,852][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 09:53:31,855][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data.
[2025-08-20 09:53:31,855][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets.
[2025-08-20 09:53:32,156][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:53:32,946][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:53:33,741][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:53:34,536][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:53:35,330][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:53:36,123][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:53:36,920][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:53:37,716][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:53:38,509][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:53:39,301][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:53:40,097][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:53:40,890][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:53:41,683][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:53:42,477][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:53:43,272][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:53:44,065][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:53:44,859][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:53:45,656][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:53:46,450][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:53:47,244][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:53:48,038][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:53:48,836][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:53:49,632][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:53:50,961][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:53:51,757][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:53:52,554][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:53:53,349][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:53:54,143][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:53:54,937][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:53:55,734][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:53:56,530][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:53:57,324][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:53:58,887][mllm.training.trainer_common][INFO] - For task: Apply reinforce step
,
            ΔVRAM Allocated: 2.288818359375e-05 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:26,
            Percentage of VRAM taken: 69.91579512747062%,
            
[2025-08-20 09:53:59,828][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/policy_optimizer_state.pt
[2025-08-20 09:53:59,830][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/critic_optimizer_state.pt
[2025-08-20 09:54:01,105][__main__][INFO] - Iteration 109 took 55s (37.94% Gen, 62.06% Train). Generation: 20s, Training: 34s. Estimated remaining time: 13h 34m 12s. Estimated total time: 15h 17m 42s. Time estimates for 10 more iterations: 9m 10s, 100 more iterations: 1h 31m 46s, 500 more iterations: 7h 38m 51s.
[2025-08-20 09:54:01,106][__main__][INFO] - Starting iteration 109.
[2025-08-20 09:54:24,271][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 09:54:24,272][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 09:54:24,279][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 09:54:26,714][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 09:54:26,716][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 09:54:26,722][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 09:54:26,724][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data.
[2025-08-20 09:54:26,725][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets.
[2025-08-20 09:54:27,023][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:54:27,816][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:54:28,608][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:54:29,403][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:54:30,199][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:54:30,992][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:54:31,786][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:54:32,928][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:54:34,751][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:54:35,543][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:54:36,337][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:54:37,130][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:54:37,922][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:54:38,718][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:54:39,511][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:54:40,304][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:54:41,099][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:54:41,893][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:54:42,686][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:54:43,480][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:54:44,362][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:54:45,158][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:54:46,777][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:54:47,570][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:54:48,360][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:54:49,155][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:54:49,946][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:54:50,739][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:54:52,333][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:54:56,015][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:54:56,989][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:54:57,781][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:54:59,363][mllm.training.trainer_common][INFO] - For task: Apply reinforce step
,
            ΔVRAM Allocated: 2.288818359375e-05 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:32,
            Percentage of VRAM taken: 69.91579512747062%,
            
[2025-08-20 09:55:00,262][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/policy_optimizer_state.pt
[2025-08-20 09:55:00,264][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/critic_optimizer_state.pt
[2025-08-20 09:55:01,603][__main__][INFO] - Iteration 110 took 1m 0s (34.23% Gen, 65.77% Train). Generation: 20s, Training: 39s. Estimated remaining time: 15h 3m 45s. Estimated total time: 16h 48m 16s. Time estimates for 10 more iterations: 10m 4s, 100 more iterations: 1h 40m 49s, 500 more iterations: 8h 24m 8s.
[2025-08-20 09:55:01,604][__main__][INFO] - Starting iteration 110.
[2025-08-20 09:55:24,853][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 09:55:24,854][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 09:55:24,860][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 09:55:27,321][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 09:55:27,323][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 09:55:27,329][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 09:55:27,332][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data.
[2025-08-20 09:55:27,332][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets.
[2025-08-20 09:55:27,632][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:55:28,422][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:55:29,214][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:55:30,005][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:55:30,800][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:55:31,591][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:55:32,383][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:55:33,177][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:55:33,969][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:55:34,763][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:55:35,557][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:55:36,349][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:55:37,143][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:55:37,938][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:55:38,732][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:55:39,524][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:55:40,321][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:55:41,115][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:55:42,372][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:55:43,165][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:55:43,960][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:55:44,755][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:55:45,548][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:55:46,341][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:55:47,137][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:55:47,932][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:55:48,725][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:55:49,520][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:55:50,316][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:55:51,110][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:55:51,905][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:55:52,699][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:55:54,267][mllm.training.trainer_common][INFO] - For task: Apply reinforce step
,
            ΔVRAM Allocated: 2.288818359375e-05 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:26,
            Percentage of VRAM taken: 69.91579512747062%,
            
[2025-08-20 09:55:55,218][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/policy_optimizer_state.pt
[2025-08-20 09:55:55,220][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/critic_optimizer_state.pt
[2025-08-20 09:55:56,586][__main__][INFO] - Iteration 111 took 54s (37.81% Gen, 62.19% Train). Generation: 20s, Training: 34s. Estimated remaining time: 13h 30m 55s. Estimated total time: 15h 16m 21s. Time estimates for 10 more iterations: 9m 9s, 100 more iterations: 1h 31m 38s, 500 more iterations: 7h 38m 10s.
[2025-08-20 09:55:56,587][__main__][INFO] - Starting iteration 111.
[2025-08-20 09:56:20,558][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 09:56:20,559][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 09:56:20,565][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 09:56:23,011][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 09:56:23,013][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 09:56:23,019][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 09:56:23,021][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data.
[2025-08-20 09:56:23,022][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets.
[2025-08-20 09:56:23,320][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:56:24,109][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:56:24,903][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:56:25,694][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:56:26,486][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:56:27,280][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:56:28,072][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:56:28,867][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:56:29,664][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:56:30,457][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:56:31,250][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:56:32,045][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:56:32,840][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:56:33,633][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:56:34,426][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:56:35,222][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:56:36,017][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:56:36,810][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:56:37,605][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:56:38,401][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:56:39,196][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:56:40,435][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:56:41,228][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:56:42,023][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:56:42,820][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:56:43,614][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:56:44,408][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:56:45,204][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:56:46,000][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:56:46,796][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:56:47,590][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:56:48,383][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:56:49,983][mllm.training.trainer_common][INFO] - For task: Apply reinforce step
,
            ΔVRAM Allocated: 2.288818359375e-05 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:26,
            Percentage of VRAM taken: 69.91579512747062%,
            
[2025-08-20 09:56:50,939][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/policy_optimizer_state.pt
[2025-08-20 09:56:50,940][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/critic_optimizer_state.pt
[2025-08-20 09:56:52,227][__main__][INFO] - Iteration 112 took 55s (38.69% Gen, 61.30% Train). Generation: 21s, Training: 34s. Estimated remaining time: 13h 40m 58s. Estimated total time: 15h 27m 19s. Time estimates for 10 more iterations: 9m 16s, 100 more iterations: 1h 32m 43s, 500 more iterations: 7h 43m 39s.
[2025-08-20 09:56:52,229][__main__][INFO] - Starting iteration 112.
[2025-08-20 09:57:15,566][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 09:57:15,567][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 09:57:15,574][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 09:57:18,033][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 09:57:18,035][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 09:57:18,041][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 09:57:18,043][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data.
[2025-08-20 09:57:18,044][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets.
[2025-08-20 09:57:18,343][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:57:19,130][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:57:19,922][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:57:20,712][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:57:21,506][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:57:22,297][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:57:23,089][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:57:23,883][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:57:24,675][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:57:25,468][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:57:26,262][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:57:27,054][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:57:27,848][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:57:28,642][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:57:29,434][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:57:30,228][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:57:31,023][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:57:31,815][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:57:32,608][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:57:33,841][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:57:34,635][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:57:35,429][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:57:36,226][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:57:37,021][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:57:37,815][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:57:38,609][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:57:39,405][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:57:40,203][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:57:40,998][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:57:41,790][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:57:42,585][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:57:43,382][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:57:44,960][mllm.training.trainer_common][INFO] - For task: Apply reinforce step
,
            ΔVRAM Allocated: 2.288818359375e-05 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:26,
            Percentage of VRAM taken: 69.91579512747062%,
            
[2025-08-20 09:57:45,909][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/policy_optimizer_state.pt
[2025-08-20 09:57:45,911][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/critic_optimizer_state.pt
[2025-08-20 09:57:47,785][__main__][INFO] - Iteration 113 took 55s (37.58% Gen, 62.41% Train). Generation: 20s, Training: 34s. Estimated remaining time: 13h 38m 39s. Estimated total time: 15h 25m 55s. Time estimates for 10 more iterations: 9m 15s, 100 more iterations: 1h 32m 35s, 500 more iterations: 7h 42m 57s.
[2025-08-20 09:57:47,786][__main__][INFO] - Starting iteration 113.
[2025-08-20 09:58:11,023][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 09:58:11,024][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 09:58:11,031][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 09:58:13,499][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 09:58:13,500][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 09:58:13,507][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 09:58:13,509][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data.
[2025-08-20 09:58:13,510][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets.
[2025-08-20 09:58:13,808][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:58:14,597][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:58:15,392][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:58:16,187][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:58:16,979][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:58:17,772][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:58:18,567][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:58:19,359][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:58:20,153][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:58:20,949][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:58:21,743][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:58:22,535][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:58:23,330][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:58:24,126][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:58:24,919][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:58:25,712][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:58:26,508][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:58:27,303][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:58:28,096][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:58:29,413][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:58:30,210][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:58:31,007][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:58:31,801][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:58:32,593][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:58:33,386][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:58:34,181][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:58:34,973][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:58:35,767][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:58:36,562][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:58:37,355][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:58:38,149][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:58:38,945][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:58:40,550][mllm.training.trainer_common][INFO] - For task: Apply reinforce step
,
            ΔVRAM Allocated: 2.288818359375e-05 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:27,
            Percentage of VRAM taken: 69.91579512747062%,
            
[2025-08-20 09:58:41,496][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/policy_optimizer_state.pt
[2025-08-20 09:58:41,497][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/critic_optimizer_state.pt
[2025-08-20 09:58:42,784][__main__][INFO] - Iteration 114 took 54s (37.80% Gen, 62.20% Train). Generation: 20s, Training: 34s. Estimated remaining time: 13h 28m 24s. Estimated total time: 15h 16m 36s. Time estimates for 10 more iterations: 9m 9s, 100 more iterations: 1h 31m 39s, 500 more iterations: 7h 38m 18s.
[2025-08-20 09:58:42,785][__main__][INFO] - Starting iteration 114.
[2025-08-20 09:59:11,669][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 09:59:11,671][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 09:59:11,677][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 09:59:14,109][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 09:59:14,111][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 09:59:14,117][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 09:59:14,119][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data.
[2025-08-20 09:59:14,120][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets.
[2025-08-20 09:59:14,417][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:59:15,208][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:59:15,997][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:59:16,789][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:59:17,579][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:59:18,370][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:59:19,161][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:59:19,952][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:59:20,745][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:59:21,534][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:59:22,330][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:59:23,121][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:59:23,912][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:59:24,707][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:59:25,498][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:59:26,290][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:59:27,084][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:59:27,876][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:59:29,076][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:59:29,867][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:59:30,658][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:59:31,450][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:59:32,244][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:59:33,036][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:59:33,829][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:59:34,623][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:59:35,414][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:59:36,207][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:59:37,001][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:59:37,793][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:59:38,587][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:59:39,380][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 09:59:40,970][mllm.training.trainer_common][INFO] - For task: Apply reinforce step
,
            ΔVRAM Allocated: 2.288818359375e-05 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:26,
            Percentage of VRAM taken: 69.91579512747062%,
            
[2025-08-20 09:59:41,938][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/policy_optimizer_state.pt
[2025-08-20 09:59:41,939][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/critic_optimizer_state.pt
[2025-08-20 09:59:43,283][__main__][INFO] - Iteration 115 took 1m 0s (43.70% Gen, 56.30% Train). Generation: 26s, Training: 34s. Estimated remaining time: 14h 59m 5s. Estimated total time: 16h 48m 17s. Time estimates for 10 more iterations: 10m 4s, 100 more iterations: 1h 40m 49s, 500 more iterations: 8h 24m 8s.
[2025-08-20 09:59:43,285][__main__][INFO] - Starting iteration 115.
[2025-08-20 10:00:06,571][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 10:00:06,572][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 10:00:06,578][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 10:00:09,024][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 10:00:09,026][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 10:00:09,032][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 10:00:09,034][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data.
[2025-08-20 10:00:09,035][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets.
[2025-08-20 10:00:09,335][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:00:10,125][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:00:10,917][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:00:11,711][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:00:12,503][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:00:13,296][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:00:14,091][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:00:14,883][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:00:15,676][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:00:16,472][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:00:17,265][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:00:18,057][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:00:18,853][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:00:19,648][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:00:20,441][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:00:21,235][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:00:22,032][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:00:22,827][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:00:23,620][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:00:24,416][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:00:25,212][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:00:26,008][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:00:26,801][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:00:27,594][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:00:28,391][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:00:29,186][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:00:29,980][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:00:31,252][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:00:32,048][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:00:32,842][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:00:33,635][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:00:34,431][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:00:36,021][mllm.training.trainer_common][INFO] - For task: Apply reinforce step
,
            ΔVRAM Allocated: 2.288818359375e-05 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:26,
            Percentage of VRAM taken: 69.91579512747062%,
            
[2025-08-20 10:00:37,025][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/policy_optimizer_state.pt
[2025-08-20 10:00:37,027][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/critic_optimizer_state.pt
[2025-08-20 10:00:38,285][__main__][INFO] - Iteration 116 took 54s (37.89% Gen, 62.11% Train). Generation: 20s, Training: 34s. Estimated remaining time: 13h 26m 32s. Estimated total time: 15h 16m 39s. Time estimates for 10 more iterations: 9m 9s, 100 more iterations: 1h 31m 39s, 500 more iterations: 7h 38m 19s.
[2025-08-20 10:00:42,985][__main__][INFO] - Starting iteration 116.
[2025-08-20 10:01:06,309][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 10:01:06,310][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 10:01:06,316][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 10:01:08,767][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 10:01:08,768][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 10:01:08,774][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 10:01:08,776][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data.
[2025-08-20 10:01:08,777][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets.
[2025-08-20 10:01:09,074][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:01:09,861][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:01:10,654][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:01:11,443][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:01:12,234][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:01:13,024][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:01:13,816][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:01:14,609][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:01:15,399][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:01:16,193][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:01:16,985][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:01:17,776][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:01:18,570][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:01:19,361][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:01:20,154][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:01:20,948][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:01:21,741][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:01:22,534][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:01:23,328][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:01:24,120][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:01:24,914][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:01:25,708][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:01:26,929][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:01:27,723][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:01:28,946][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:01:29,740][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:01:30,534][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:01:31,331][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:01:32,127][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:01:32,921][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:01:33,715][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:01:34,512][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:01:36,112][mllm.training.trainer_common][INFO] - For task: Apply reinforce step
,
            ΔVRAM Allocated: 2.288818359375e-05 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:27,
            Percentage of VRAM taken: 69.91579512747062%,
            
[2025-08-20 10:01:37,081][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/policy_optimizer_state.pt
[2025-08-20 10:01:37,082][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/critic_optimizer_state.pt
[2025-08-20 10:01:38,411][__main__][INFO] - Iteration 117 took 55s (37.67% Gen, 62.33% Train). Generation: 20s, Training: 34s. Estimated remaining time: 13h 32m 27s. Estimated total time: 15h 23m 34s. Time estimates for 10 more iterations: 9m 14s, 100 more iterations: 1h 32m 21s, 500 more iterations: 7h 41m 47s.
[2025-08-20 10:01:38,413][__main__][INFO] - Starting iteration 117.
[2025-08-20 10:02:01,948][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 10:02:01,949][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 10:02:01,955][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 10:02:04,401][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 10:02:04,402][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 10:02:04,409][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 10:02:04,411][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data.
[2025-08-20 10:02:04,412][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets.
[2025-08-20 10:02:04,711][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:02:05,500][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:02:06,294][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:02:07,086][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:02:07,880][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:02:08,674][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:02:09,467][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:02:10,260][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:02:11,054][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:02:11,846][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:02:12,640][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:02:13,434][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:02:14,227][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:02:15,020][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:02:15,816][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:02:16,610][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:02:17,402][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:02:18,197][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:02:19,438][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:02:20,230][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:02:21,022][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:02:21,815][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:02:22,608][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:02:23,401][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:02:24,196][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:02:24,989][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:02:25,782][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:02:26,576][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:02:27,373][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:02:28,168][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:02:28,962][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:02:29,756][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:02:31,371][mllm.training.trainer_common][INFO] - For task: Apply reinforce step
,
            ΔVRAM Allocated: 2.288818359375e-05 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:26,
            Percentage of VRAM taken: 69.91579512747062%,
            
[2025-08-20 10:02:32,357][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/policy_optimizer_state.pt
[2025-08-20 10:02:32,358][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/critic_optimizer_state.pt
[2025-08-20 10:02:33,699][__main__][INFO] - Iteration 118 took 55s (38.17% Gen, 61.83% Train). Generation: 21s, Training: 34s. Estimated remaining time: 13h 29m 23s. Estimated total time: 15h 21m 26s. Time estimates for 10 more iterations: 9m 12s, 100 more iterations: 1h 32m 8s, 500 more iterations: 7h 40m 43s.
[2025-08-20 10:02:33,700][__main__][INFO] - Starting iteration 118.
[2025-08-20 10:03:00,949][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 10:03:00,951][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 10:03:00,957][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 10:03:03,408][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 10:03:03,409][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 10:03:03,416][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 10:03:03,418][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data.
[2025-08-20 10:03:03,419][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets.
[2025-08-20 10:03:03,716][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:03:04,505][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:03:05,298][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:03:06,089][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:03:06,881][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:03:07,674][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:03:08,465][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:03:09,260][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:03:10,053][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:03:10,844][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:03:11,639][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:03:12,432][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:03:13,224][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:03:14,019][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:03:14,813][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:03:15,605][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:03:16,399][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:03:17,195][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:03:17,988][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:03:18,781][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:03:20,018][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:03:20,811][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:03:21,604][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:03:22,397][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:03:23,192][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:03:23,984][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:03:24,778][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:03:25,576][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:03:26,372][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:03:27,165][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:03:27,959][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:03:28,755][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:03:30,360][mllm.training.trainer_common][INFO] - For task: Apply reinforce step
,
            ΔVRAM Allocated: 2.288818359375e-05 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:26,
            Percentage of VRAM taken: 69.91579512747062%,
            
[2025-08-20 10:03:31,354][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/policy_optimizer_state.pt
[2025-08-20 10:03:31,356][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/critic_optimizer_state.pt
[2025-08-20 10:03:32,663][__main__][INFO] - Iteration 119 took 58s (42.07% Gen, 57.92% Train). Generation: 24s, Training: 34s. Estimated remaining time: 14h 29m 40s. Estimated total time: 16h 22m 42s. Time estimates for 10 more iterations: 9m 49s, 100 more iterations: 1h 38m 16s, 500 more iterations: 8h 11m 21s.
[2025-08-20 10:03:32,672][__main__][INFO] - Starting iteration 119.
[2025-08-20 10:04:05,261][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 10:04:05,262][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 10:04:05,269][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 10:04:07,731][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 10:04:07,733][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 10:04:07,739][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 10:04:07,741][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data.
[2025-08-20 10:04:07,742][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets.
[2025-08-20 10:04:08,041][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:04:08,828][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:04:09,619][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:04:10,409][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:04:11,200][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:04:11,991][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:04:12,782][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:04:13,575][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:04:14,365][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:04:15,156][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:04:15,948][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:04:16,740][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:04:17,532][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:04:18,323][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:04:19,117][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:04:19,910][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:04:20,701][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:04:21,497][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:04:22,291][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:04:23,084][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:04:23,878][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:04:24,674][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:04:25,917][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:04:26,709][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:04:27,501][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:04:28,297][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:04:29,092][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:04:29,886][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:04:30,679][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:04:31,477][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:04:32,274][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:04:33,068][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:04:34,661][mllm.training.trainer_common][INFO] - For task: Apply reinforce step
,
            ΔVRAM Allocated: 2.288818359375e-05 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:26,
            Percentage of VRAM taken: 69.91579512747062%,
            
[2025-08-20 10:04:35,625][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/policy_optimizer_state.pt
[2025-08-20 10:04:35,626][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/critic_optimizer_state.pt
[2025-08-20 10:04:36,967][__main__][INFO] - Iteration 120 took 1m 4s (46.87% Gen, 53.13% Train). Generation: 30s, Training: 34s. Estimated remaining time: 15h 57m 29s. Estimated total time: 17h 51m 35s. Time estimates for 10 more iterations: 10m 42s, 100 more iterations: 1h 47m 9s, 500 more iterations: 8h 55m 47s.
[2025-08-20 10:04:36,969][__main__][INFO] - Starting iteration 120.
[2025-08-20 10:05:00,124][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 10:05:00,125][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 10:05:00,131][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 10:05:02,588][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 10:05:02,589][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 10:05:02,596][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 10:05:02,598][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data.
[2025-08-20 10:05:02,598][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets.
[2025-08-20 10:05:02,901][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:05:03,689][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:05:04,481][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:05:05,271][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:05:06,063][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:05:06,896][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:05:09,520][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:05:10,311][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:05:11,104][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:05:15,104][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:05:15,898][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:05:16,691][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:05:17,483][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:05:18,277][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:05:19,069][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:05:19,862][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:05:20,658][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:05:21,453][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:05:22,247][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:05:23,041][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:05:23,837][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:05:24,632][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:05:25,425][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:05:26,220][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:05:27,592][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:05:28,384][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:05:29,179][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:05:29,977][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:05:30,772][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:05:31,566][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:05:32,360][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:05:33,157][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:05:34,780][mllm.training.trainer_common][INFO] - For task: Apply reinforce step
,
            ΔVRAM Allocated: 2.288818359375e-05 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:32,
            Percentage of VRAM taken: 69.91579512747062%,
            
[2025-08-20 10:05:35,769][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/policy_optimizer_state.pt
[2025-08-20 10:05:35,770][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/critic_optimizer_state.pt
[2025-08-20 10:05:37,321][__main__][INFO] - Iteration 121 took 1m 0s (34.30% Gen, 65.69% Train). Generation: 20s, Training: 39s. Estimated remaining time: 14h 50m 45s. Estimated total time: 16h 45m 51s. Time estimates for 10 more iterations: 10m 3s, 100 more iterations: 1h 40m 35s, 500 more iterations: 8h 22m 55s.
[2025-08-20 10:05:37,322][__main__][INFO] - Starting iteration 121.
[2025-08-20 10:06:00,400][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 10:06:00,402][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 10:06:00,408][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 10:06:02,863][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 10:06:02,864][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 10:06:02,871][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 10:06:02,873][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data.
[2025-08-20 10:06:02,874][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets.
[2025-08-20 10:06:03,171][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:06:03,961][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:06:04,750][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:06:05,542][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:06:06,332][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:06:07,124][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:06:07,916][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:06:08,708][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:06:09,502][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:06:10,294][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:06:11,293][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:06:12,088][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:06:12,884][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:06:13,680][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:06:14,474][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:06:15,267][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:06:16,064][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:06:16,861][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:06:18,078][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:06:18,872][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:06:19,666][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:06:20,463][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:06:21,261][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:06:22,058][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:06:22,853][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:06:23,647][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:06:24,443][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:06:25,241][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:06:26,040][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:06:26,837][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:06:27,633][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:06:28,428][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:06:30,027][mllm.training.trainer_common][INFO] - For task: Apply reinforce step
,
            ΔVRAM Allocated: 2.288818359375e-05 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:27,
            Percentage of VRAM taken: 69.91579512747062%,
            
[2025-08-20 10:06:31,024][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/policy_optimizer_state.pt
[2025-08-20 10:06:31,025][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/critic_optimizer_state.pt
[2025-08-20 10:06:32,641][__main__][INFO] - Iteration 122 took 55s (37.28% Gen, 62.71% Train). Generation: 20s, Training: 34s. Estimated remaining time: 13h 25m 56s. Estimated total time: 15h 21m 58s. Time estimates for 10 more iterations: 9m 13s, 100 more iterations: 1h 32m 11s, 500 more iterations: 7h 40m 59s.
[2025-08-20 10:06:32,643][__main__][INFO] - Starting iteration 122.
[2025-08-20 10:06:56,769][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 10:06:56,771][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 10:06:56,777][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 10:06:59,211][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 10:06:59,213][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 10:06:59,219][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 10:06:59,221][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data.
[2025-08-20 10:06:59,222][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets.
[2025-08-20 10:06:59,520][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:07:00,310][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:07:01,104][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:07:01,895][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:07:02,689][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:07:03,484][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:07:04,276][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:07:05,068][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:07:05,863][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:07:06,658][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:07:07,451][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:07:08,248][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:07:09,044][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:07:09,838][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:07:10,632][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:07:11,427][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:07:12,226][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:07:13,022][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:07:13,817][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:07:14,612][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:07:15,407][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:07:16,205][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:07:17,003][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:07:17,800][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:07:18,595][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:07:19,390][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:07:20,641][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:07:21,436][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:07:22,230][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:07:23,026][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:07:23,824][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:07:24,622][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:07:26,225][mllm.training.trainer_common][INFO] - For task: Apply reinforce step
,
            ΔVRAM Allocated: 2.288818359375e-05 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:26,
            Percentage of VRAM taken: 69.91579512747062%,
            
[2025-08-20 10:07:27,182][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/policy_optimizer_state.pt
[2025-08-20 10:07:27,184][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/critic_optimizer_state.pt
[2025-08-20 10:07:29,127][__main__][INFO] - Iteration 123 took 56s (38.41% Gen, 61.59% Train). Generation: 21s, Training: 34s. Estimated remaining time: 13h 44m 25s. Estimated total time: 15h 41m 23s. Time estimates for 10 more iterations: 9m 24s, 100 more iterations: 1h 34m 8s, 500 more iterations: 7h 50m 41s.
[2025-08-20 10:07:29,128][__main__][INFO] - Starting iteration 123.
[2025-08-20 10:07:52,219][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 10:07:52,220][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 10:07:52,226][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 10:07:54,655][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 10:07:54,656][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 10:07:54,663][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 10:07:54,665][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data.
[2025-08-20 10:07:54,666][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets.
[2025-08-20 10:07:54,964][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:07:55,753][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:07:56,547][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:07:57,339][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:07:58,132][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:07:58,926][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:07:59,718][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:08:00,512][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:08:01,309][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:08:02,104][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:08:02,897][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:08:03,691][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:08:04,489][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:08:05,285][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:08:06,079][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:08:06,873][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:08:07,670][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:08:08,467][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:08:09,264][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:08:10,059][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:08:11,343][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:08:12,136][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:08:12,930][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:08:13,728][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:08:14,527][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:08:15,324][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:08:16,121][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:08:16,916][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:08:17,711][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:08:18,508][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:08:19,306][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:08:20,104][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:08:21,748][mllm.training.trainer_common][INFO] - For task: Apply reinforce step
,
            ΔVRAM Allocated: 2.288818359375e-05 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:27,
            Percentage of VRAM taken: 69.91579512747062%,
            
[2025-08-20 10:08:22,810][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/policy_optimizer_state.pt
[2025-08-20 10:08:22,812][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/critic_optimizer_state.pt
[2025-08-20 10:08:24,138][__main__][INFO] - Iteration 124 took 55s (37.57% Gen, 62.43% Train). Generation: 20s, Training: 34s. Estimated remaining time: 13h 18m 55s. Estimated total time: 15h 16m 49s. Time estimates for 10 more iterations: 9m 10s, 100 more iterations: 1h 31m 40s, 500 more iterations: 7h 38m 24s.
[2025-08-20 10:08:24,139][__main__][INFO] - Starting iteration 124.
[2025-08-20 10:08:47,236][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 10:08:47,237][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 10:08:47,244][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 10:08:49,681][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 10:08:49,682][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 10:08:49,689][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 10:08:49,690][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data.
[2025-08-20 10:08:49,691][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets.
[2025-08-20 10:08:49,989][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:08:50,779][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:08:51,573][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:08:52,366][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:08:53,159][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:08:53,953][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:08:54,748][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:08:55,542][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:08:56,335][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:08:57,131][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:08:57,925][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:08:58,718][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:08:59,514][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:09:00,311][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:09:01,107][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:09:01,900][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:09:02,695][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:09:03,492][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:09:04,291][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:09:05,087][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:09:05,882][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:09:06,676][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:09:07,472][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:09:08,706][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:09:09,500][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:09:10,295][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:09:11,091][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:09:11,889][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:09:12,686][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:09:13,482][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:09:14,277][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:09:15,072][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:09:16,677][mllm.training.trainer_common][INFO] - For task: Apply reinforce step
,
            ΔVRAM Allocated: 2.288818359375e-05 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:26,
            Percentage of VRAM taken: 69.91579512747062%,
            
[2025-08-20 10:09:17,619][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/policy_optimizer_state.pt
[2025-08-20 10:09:17,621][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/critic_optimizer_state.pt
[2025-08-20 10:09:18,944][__main__][INFO] - Iteration 125 took 54s (37.71% Gen, 62.28% Train). Generation: 20s, Training: 34s. Estimated remaining time: 13h 14m 35s. Estimated total time: 15h 13m 23s. Time estimates for 10 more iterations: 9m 8s, 100 more iterations: 1h 31m 20s, 500 more iterations: 7h 36m 41s.
[2025-08-20 10:09:18,945][__main__][INFO] - Starting iteration 125.
[2025-08-20 10:09:42,463][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 10:09:42,465][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 10:09:42,471][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 10:09:44,919][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 10:09:44,921][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 10:09:44,927][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 10:09:44,929][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data.
[2025-08-20 10:09:44,930][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets.
[2025-08-20 10:09:45,230][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:09:46,019][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:09:46,814][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:09:47,606][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:09:48,398][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:09:49,194][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:09:49,987][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:09:50,780][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:09:51,576][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:09:52,371][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:09:53,164][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:09:53,958][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:09:54,755][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:09:55,551][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:09:56,345][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:09:57,139][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:09:57,936][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:09:58,734][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:09:59,532][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:10:00,329][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:10:01,123][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:10:01,918][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:10:02,714][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:10:03,989][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:10:04,784][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:10:05,579][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:10:06,374][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:10:07,172][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:10:07,970][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:10:08,767][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:10:09,563][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:10:10,358][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:10:11,962][mllm.training.trainer_common][INFO] - For task: Apply reinforce step
,
            ΔVRAM Allocated: 2.288818359375e-05 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:26,
            Percentage of VRAM taken: 69.91579512747062%,
            
[2025-08-20 10:10:12,926][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/policy_optimizer_state.pt
[2025-08-20 10:10:12,928][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/critic_optimizer_state.pt
[2025-08-20 10:10:14,427][__main__][INFO] - Iteration 126 took 55s (37.98% Gen, 62.02% Train). Generation: 21s, Training: 34s. Estimated remaining time: 13h 24m 58s. Estimated total time: 15h 24m 41s. Time estimates for 10 more iterations: 9m 14s, 100 more iterations: 1h 32m 28s, 500 more iterations: 7h 42m 20s.
[2025-08-20 10:10:14,429][__main__][INFO] - Starting iteration 126.
[2025-08-20 10:10:37,605][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 10:10:37,607][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 10:10:37,613][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 10:10:40,090][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 10:10:40,091][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 10:10:40,098][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 10:10:40,100][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data.
[2025-08-20 10:10:40,100][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets.
[2025-08-20 10:10:40,399][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:10:41,187][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:10:41,978][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:10:42,770][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:10:43,561][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:10:44,355][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:10:45,147][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:10:45,939][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:10:46,734][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:10:47,527][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:10:48,321][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:10:49,117][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:10:49,911][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:10:50,704][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:10:51,498][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:10:52,294][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:10:53,089][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:10:53,881][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:10:54,677][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:10:55,472][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:10:56,266][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:10:57,059][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:10:57,857][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:10:58,655][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:10:59,453][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:11:00,249][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:11:01,535][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:11:02,329][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:11:03,123][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:11:03,917][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:11:04,714][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:11:05,513][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:11:07,128][mllm.training.trainer_common][INFO] - For task: Apply reinforce step
,
            ΔVRAM Allocated: 2.288818359375e-05 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:26,
            Percentage of VRAM taken: 69.91579512747062%,
            
[2025-08-20 10:11:08,120][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/policy_optimizer_state.pt
[2025-08-20 10:11:08,122][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/critic_optimizer_state.pt
[2025-08-20 10:11:09,483][__main__][INFO] - Iteration 127 took 55s (37.62% Gen, 62.38% Train). Generation: 20s, Training: 34s. Estimated remaining time: 13h 16m 55s. Estimated total time: 15h 17m 33s. Time estimates for 10 more iterations: 9m 10s, 100 more iterations: 1h 31m 45s, 500 more iterations: 7h 38m 46s.
[2025-08-20 10:11:09,484][__main__][INFO] - Starting iteration 127.
[2025-08-20 10:11:32,638][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 10:11:32,640][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 10:11:32,646][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 10:11:35,089][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 10:11:35,090][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 10:11:35,096][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 10:11:35,099][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data.
[2025-08-20 10:11:35,099][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets.
[2025-08-20 10:11:35,398][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:11:36,187][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:11:36,981][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:11:37,775][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:11:38,567][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:11:39,363][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:11:40,159][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:11:40,951][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:11:41,745][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:11:42,541][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:11:43,337][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:11:44,131][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:11:44,926][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:11:45,721][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:11:46,516][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:11:47,311][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:11:48,107][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:11:48,902][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:11:49,699][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:11:50,496][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:11:51,290][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:11:52,085][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:11:52,881][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:11:53,678][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:11:54,900][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:11:55,695][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:11:56,491][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:11:57,287][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:11:58,082][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:11:58,879][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:11:59,675][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:12:00,472][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:12:02,067][mllm.training.trainer_common][INFO] - For task: Apply reinforce step
,
            ΔVRAM Allocated: 2.288818359375e-05 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:26,
            Percentage of VRAM taken: 69.91579512747062%,
            
[2025-08-20 10:12:03,029][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/policy_optimizer_state.pt
[2025-08-20 10:12:03,031][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/critic_optimizer_state.pt
[2025-08-20 10:12:04,538][__main__][INFO] - Iteration 128 took 55s (37.61% Gen, 62.39% Train). Generation: 20s, Training: 34s. Estimated remaining time: 13h 16m 0s. Estimated total time: 15h 17m 33s. Time estimates for 10 more iterations: 9m 10s, 100 more iterations: 1h 31m 45s, 500 more iterations: 7h 38m 46s.
[2025-08-20 10:12:04,540][__main__][INFO] - Starting iteration 128.
[2025-08-20 10:12:28,750][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 10:12:28,751][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 10:12:28,758][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 10:12:31,205][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 10:12:31,206][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 10:12:31,212][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 10:12:31,215][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data.
[2025-08-20 10:12:31,215][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets.
[2025-08-20 10:12:31,514][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:12:32,305][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:12:33,098][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:12:33,891][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:12:34,685][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:12:35,481][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:12:36,275][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:12:37,066][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:12:37,863][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:12:38,659][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:12:39,452][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:12:40,246][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:12:41,044][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:12:41,840][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:12:42,635][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:12:43,429][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:12:44,225][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:12:45,023][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:12:45,821][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:12:46,617][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:12:47,412][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:12:48,206][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:12:49,002][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:12:49,800][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:12:50,599][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:12:51,908][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:12:52,703][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:12:53,501][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:12:54,299][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:12:55,096][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:12:55,891][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:12:56,686][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:12:58,291][mllm.training.trainer_common][INFO] - For task: Apply reinforce step
,
            ΔVRAM Allocated: 2.288818359375e-05 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:27,
            Percentage of VRAM taken: 69.91579512747062%,
            
[2025-08-20 10:12:59,267][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/policy_optimizer_state.pt
[2025-08-20 10:12:59,270][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/critic_optimizer_state.pt
[2025-08-20 10:13:00,638][__main__][INFO] - Iteration 129 took 56s (38.80% Gen, 61.20% Train). Generation: 21s, Training: 34s. Estimated remaining time: 13h 32m 28s. Estimated total time: 15h 34m 58s. Time estimates for 10 more iterations: 9m 20s, 100 more iterations: 1h 33m 29s, 500 more iterations: 7h 47m 29s.
[2025-08-20 10:13:00,640][__main__][INFO] - Starting iteration 129.
[2025-08-20 10:13:24,045][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 10:13:24,046][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 10:13:24,053][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 10:13:26,512][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 10:13:26,513][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 10:13:26,519][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 10:13:26,522][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data.
[2025-08-20 10:13:26,522][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets.
[2025-08-20 10:13:26,820][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:13:27,611][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:13:28,406][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:13:29,199][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:13:29,993][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:13:30,788][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:13:31,583][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:13:32,376][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:13:33,169][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:13:33,965][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:13:34,761][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:13:35,556][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:13:36,350][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:13:37,147][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:13:37,945][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:13:38,741][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:13:40,028][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:13:40,826][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:13:41,624][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:13:42,420][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:13:43,214][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:13:44,009][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:13:44,805][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:13:45,603][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:13:46,401][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:13:47,198][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:13:47,993][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:13:48,787][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:13:49,584][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:13:50,383][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:13:51,182][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:13:51,980][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:13:53,657][mllm.training.trainer_common][INFO] - For task: Apply reinforce step
,
            ΔVRAM Allocated: 2.288818359375e-05 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:27,
            Percentage of VRAM taken: 69.91579512747062%,
            
[2025-08-20 10:13:54,603][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/policy_optimizer_state.pt
[2025-08-20 10:13:54,605][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/critic_optimizer_state.pt
[2025-08-20 10:13:55,999][__main__][INFO] - Iteration 130 took 55s (37.86% Gen, 62.14% Train). Generation: 20s, Training: 34s. Estimated remaining time: 13h 19m 13s. Estimated total time: 15h 22m 38s. Time estimates for 10 more iterations: 9m 13s, 100 more iterations: 1h 32m 15s, 500 more iterations: 7h 41m 19s.
[2025-08-20 10:13:56,000][__main__][INFO] - Starting iteration 130.
[2025-08-20 10:14:19,090][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 10:14:19,091][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 10:14:19,098][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 10:14:21,532][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 10:14:21,533][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 10:14:21,540][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 10:14:21,542][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data.
[2025-08-20 10:14:21,542][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets.
[2025-08-20 10:14:21,841][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:14:22,629][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:14:23,420][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:14:24,211][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:14:25,005][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:14:25,797][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:14:26,591][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:14:27,387][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:14:28,179][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:14:28,972][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:14:29,768][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:14:30,563][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:14:31,357][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:14:32,150][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:14:32,948][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:14:33,744][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:14:34,539][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:14:35,333][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:14:36,568][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:14:37,364][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:14:38,158][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:14:38,953][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:14:39,749][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:14:40,546][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:14:41,344][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:14:42,140][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:14:42,936][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:14:43,730][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:14:44,527][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:14:45,326][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:14:46,124][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:14:46,922][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:14:48,541][mllm.training.trainer_common][INFO] - For task: Apply reinforce step
,
            ΔVRAM Allocated: 2.288818359375e-05 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:26,
            Percentage of VRAM taken: 69.91579512747062%,
            
[2025-08-20 10:14:49,523][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/policy_optimizer_state.pt
[2025-08-20 10:14:49,525][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/critic_optimizer_state.pt
[2025-08-20 10:14:50,862][__main__][INFO] - Iteration 131 took 54s (37.64% Gen, 62.36% Train). Generation: 20s, Training: 34s. Estimated remaining time: 13h 10m 1s. Estimated total time: 15h 14m 21s. Time estimates for 10 more iterations: 9m 8s, 100 more iterations: 1h 31m 26s, 500 more iterations: 7h 37m 10s.
[2025-08-20 10:14:50,864][__main__][INFO] - Starting iteration 131.
[2025-08-20 10:15:14,212][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 10:15:14,214][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 10:15:14,220][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 10:15:16,664][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 10:15:16,666][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 10:15:16,672][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 10:15:16,674][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data.
[2025-08-20 10:15:16,675][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets.
[2025-08-20 10:15:16,976][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:15:17,770][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:15:18,560][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:15:19,353][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:15:20,150][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:15:20,943][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:15:22,452][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:15:23,245][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:15:24,039][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:15:24,832][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:15:25,628][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:15:26,420][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:15:27,214][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:15:28,011][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:15:29,881][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:15:31,147][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:15:31,938][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:15:32,730][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:15:33,525][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:15:34,317][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:15:35,110][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:15:35,906][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:15:36,699][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:15:37,491][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:15:38,285][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:15:39,079][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:15:39,871][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:15:40,664][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:15:41,458][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:15:42,251][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:15:43,046][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:15:43,841][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:15:45,412][mllm.training.trainer_common][INFO] - For task: Apply reinforce step
,
            ΔVRAM Allocated: 2.288818359375e-05 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:28,
            Percentage of VRAM taken: 69.91579512747062%,
            
[2025-08-20 10:15:46,439][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/policy_optimizer_state.pt
[2025-08-20 10:15:46,442][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/critic_optimizer_state.pt
[2025-08-20 10:15:47,694][__main__][INFO] - Iteration 132 took 56s (36.79% Gen, 63.20% Train). Generation: 20s, Training: 35s. Estimated remaining time: 13h 41m 52s. Estimated total time: 15h 47m 9s. Time estimates for 10 more iterations: 9m 28s, 100 more iterations: 1h 34m 42s, 500 more iterations: 7h 53m 34s.
[2025-08-20 10:15:47,696][__main__][INFO] - Starting iteration 132.
[2025-08-20 10:16:10,784][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 10:16:10,785][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 10:16:10,791][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 10:16:13,223][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 10:16:13,225][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 10:16:13,232][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 10:16:13,234][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data.
[2025-08-20 10:16:13,234][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets.
[2025-08-20 10:16:13,537][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:16:14,329][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:16:15,122][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:16:15,915][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:16:16,710][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:16:17,503][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:16:18,297][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:16:19,094][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:16:19,890][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:16:20,683][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:16:21,477][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:16:22,274][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:16:23,071][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:16:23,866][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:16:24,660][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:16:25,455][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:16:26,252][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:16:27,049][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:16:27,844][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:16:28,638][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:16:29,894][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:16:30,691][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:16:31,488][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:16:32,967][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:16:33,762][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:16:34,602][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:16:35,397][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:16:36,193][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:16:36,991][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:16:37,789][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:16:38,586][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:16:42,369][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:16:43,965][mllm.training.trainer_common][INFO] - For task: Apply reinforce step
,
            ΔVRAM Allocated: 2.288818359375e-05 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:30,
            Percentage of VRAM taken: 69.91579512747062%,
            
[2025-08-20 10:16:44,943][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/policy_optimizer_state.pt
[2025-08-20 10:16:44,944][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/critic_optimizer_state.pt
[2025-08-20 10:16:46,534][__main__][INFO] - Iteration 133 took 58s (35.10% Gen, 64.90% Train). Generation: 20s, Training: 38s. Estimated remaining time: 14h 14m 22s. Estimated total time: 16h 20m 38s. Time estimates for 10 more iterations: 9m 48s, 100 more iterations: 1h 38m 3s, 500 more iterations: 8h 10m 19s.
[2025-08-20 10:16:46,536][__main__][INFO] - Starting iteration 133.
[2025-08-20 10:17:09,890][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 10:17:09,891][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 10:17:09,897][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 10:17:12,363][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 10:17:12,364][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 10:17:12,370][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 10:17:12,372][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data.
[2025-08-20 10:17:12,373][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets.
[2025-08-20 10:17:12,677][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:17:13,466][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:17:14,257][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:17:15,048][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:17:15,840][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:17:16,635][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:17:17,434][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:17:18,227][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:17:19,019][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:17:19,814][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:17:20,607][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:17:21,400][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:17:22,296][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:17:23,092][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:17:23,886][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:17:24,678][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:17:25,474][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:17:26,271][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:17:27,064][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:17:27,857][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:17:28,654][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:17:29,450][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:17:30,244][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:17:31,460][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:17:32,255][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:17:33,052][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:17:33,847][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:17:34,640][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:17:35,433][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:17:36,229][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:17:37,024][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:17:37,818][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:17:39,367][mllm.training.trainer_common][INFO] - For task: Apply reinforce step
,
            ΔVRAM Allocated: 2.288818359375e-05 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:26,
            Percentage of VRAM taken: 69.91579512747062%,
            
[2025-08-20 10:17:40,325][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/policy_optimizer_state.pt
[2025-08-20 10:17:40,327][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/critic_optimizer_state.pt
[2025-08-20 10:17:41,628][__main__][INFO] - Iteration 134 took 55s (37.91% Gen, 62.08% Train). Generation: 20s, Training: 34s. Estimated remaining time: 13h 11m 0s. Estimated total time: 15h 18m 11s. Time estimates for 10 more iterations: 9m 10s, 100 more iterations: 1h 31m 49s, 500 more iterations: 7h 39m 5s.
[2025-08-20 10:17:41,630][__main__][INFO] - Starting iteration 134.
[2025-08-20 10:18:04,661][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 10:18:04,663][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 10:18:04,669][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 10:18:07,121][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 10:18:07,122][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 10:18:07,129][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 10:18:07,131][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data.
[2025-08-20 10:18:07,132][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets.
[2025-08-20 10:18:07,433][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:18:08,223][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:18:09,018][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:18:09,812][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:18:10,605][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:18:11,401][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:18:12,198][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:18:12,993][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:18:13,786][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:18:14,581][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:18:15,379][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:18:16,175][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:18:16,970][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:18:17,764][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:18:18,560][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:18:19,357][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:18:20,154][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:18:20,949][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:18:21,743][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:18:22,539][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:18:23,337][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:18:24,134][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:18:24,931][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:18:26,289][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:18:27,083][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:18:27,877][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:18:28,673][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:18:29,469][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:18:30,263][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:18:31,057][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:18:31,853][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:18:32,650][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:18:34,224][mllm.training.trainer_common][INFO] - For task: Apply reinforce step
,
            ΔVRAM Allocated: 2.288818359375e-05 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:27,
            Percentage of VRAM taken: 69.91579512747062%,
            
[2025-08-20 10:18:35,196][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/policy_optimizer_state.pt
[2025-08-20 10:18:35,198][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/critic_optimizer_state.pt
[2025-08-20 10:18:36,535][__main__][INFO] - Iteration 135 took 54s (37.49% Gen, 62.51% Train). Generation: 20s, Training: 34s. Estimated remaining time: 13h 6m 56s. Estimated total time: 15h 15m 2s. Time estimates for 10 more iterations: 9m 9s, 100 more iterations: 1h 31m 30s, 500 more iterations: 7h 37m 31s.
[2025-08-20 10:18:36,537][__main__][INFO] - Starting iteration 135.
[2025-08-20 10:18:59,710][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 10:18:59,711][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 10:18:59,718][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 10:19:02,169][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 10:19:02,171][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 10:19:02,177][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 10:19:02,180][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data.
[2025-08-20 10:19:02,180][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets.
[2025-08-20 10:19:02,478][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:19:03,270][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:19:04,064][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:19:04,860][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:19:05,654][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:19:06,447][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:19:07,244][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:19:08,041][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:19:08,837][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:19:09,631][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:19:10,425][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:19:11,223][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:19:12,020][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:19:12,816][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:19:13,611][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:19:14,406][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:19:15,202][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:19:16,000][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:19:16,798][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:19:17,595][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:19:18,390][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:19:19,185][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:19:20,414][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:19:21,207][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:19:21,999][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:19:22,794][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:19:23,587][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:19:24,380][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:19:25,175][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:19:25,970][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:19:26,763][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:19:27,558][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:19:29,167][mllm.training.trainer_common][INFO] - For task: Apply reinforce step
,
            ΔVRAM Allocated: 2.288818359375e-05 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:26,
            Percentage of VRAM taken: 69.91579512747062%,
            
[2025-08-20 10:19:30,142][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/policy_optimizer_state.pt
[2025-08-20 10:19:30,143][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/critic_optimizer_state.pt
[2025-08-20 10:19:31,608][__main__][INFO] - Iteration 136 took 55s (37.62% Gen, 62.38% Train). Generation: 20s, Training: 34s. Estimated remaining time: 13h 8m 50s. Estimated total time: 15h 17m 51s. Time estimates for 10 more iterations: 9m 10s, 100 more iterations: 1h 31m 47s, 500 more iterations: 7h 38m 55s.
[2025-08-20 10:19:31,610][__main__][INFO] - Starting iteration 136.
[2025-08-20 10:20:00,866][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 10:20:00,867][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 10:20:00,874][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 10:20:03,326][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 10:20:03,327][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 10:20:03,334][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 10:20:03,336][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data.
[2025-08-20 10:20:03,337][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets.
[2025-08-20 10:20:03,641][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:20:04,430][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:20:05,223][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:20:06,016][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:20:06,807][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:20:07,602][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:20:08,396][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:20:09,188][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:20:09,982][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:20:10,777][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:20:11,570][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:20:12,363][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:20:13,160][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:20:13,955][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:20:14,747][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:20:15,541][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:20:16,335][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:20:17,127][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:20:17,920][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:20:19,145][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:20:19,940][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:20:20,734][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:20:21,526][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:20:22,319][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:20:23,115][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:20:23,907][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:20:24,700][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:20:25,496][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:20:26,289][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:20:27,081][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:20:27,877][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:20:28,672][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:20:30,257][mllm.training.trainer_common][INFO] - For task: Apply reinforce step
,
            ΔVRAM Allocated: 2.288818359375e-05 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:26,
            Percentage of VRAM taken: 69.91579512747062%,
            
[2025-08-20 10:20:31,229][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/policy_optimizer_state.pt
[2025-08-20 10:20:31,230][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/critic_optimizer_state.pt
[2025-08-20 10:20:32,693][__main__][INFO] - Iteration 137 took 1m 1s (43.85% Gen, 56.14% Train). Generation: 26s, Training: 34s. Estimated remaining time: 14h 48m 1s. Estimated total time: 16h 58m 3s. Time estimates for 10 more iterations: 10m 10s, 100 more iterations: 1h 41m 48s, 500 more iterations: 8h 29m 1s.
[2025-08-20 10:20:32,698][__main__][INFO] - Starting iteration 137.
[2025-08-20 10:20:55,819][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 10:20:55,821][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 10:20:55,827][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 10:20:58,286][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 10:20:58,288][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 10:20:58,294][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 10:20:58,297][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data.
[2025-08-20 10:20:58,297][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets.
[2025-08-20 10:20:58,600][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:20:59,389][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:21:00,184][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:21:00,975][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:21:01,768][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:21:02,563][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:21:03,355][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:21:04,148][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:21:04,944][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:21:05,737][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:21:06,530][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:21:07,325][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:21:08,122][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:21:08,925][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:21:09,720][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:21:10,511][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:21:11,303][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:21:12,097][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:21:12,889][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:21:14,197][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:21:14,991][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:21:15,785][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:21:16,582][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:21:17,377][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:21:18,171][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:21:18,964][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:21:19,761][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:21:20,558][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:21:21,352][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:21:22,146][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:21:22,942][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:21:23,739][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:21:25,341][mllm.training.trainer_common][INFO] - For task: Apply reinforce step
,
            ΔVRAM Allocated: 2.288818359375e-05 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:27,
            Percentage of VRAM taken: 69.91579512747062%,
            
[2025-08-20 10:21:26,328][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/policy_optimizer_state.pt
[2025-08-20 10:21:26,329][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/critic_optimizer_state.pt
[2025-08-20 10:21:27,606][__main__][INFO] - Iteration 138 took 54s (37.63% Gen, 62.37% Train). Generation: 20s, Training: 34s. Estimated remaining time: 13h 4m 7s. Estimated total time: 15h 15m 4s. Time estimates for 10 more iterations: 9m 9s, 100 more iterations: 1h 31m 30s, 500 more iterations: 7h 37m 32s.
[2025-08-20 10:21:27,608][__main__][INFO] - Starting iteration 138.
[2025-08-20 10:21:51,244][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 10:21:51,246][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 10:21:51,252][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 10:21:53,699][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 10:21:53,701][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 10:21:53,707][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 10:21:53,710][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data.
[2025-08-20 10:21:53,710][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets.
[2025-08-20 10:21:54,009][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:21:54,800][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:21:55,592][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:21:56,385][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:21:57,178][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:21:57,972][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:21:58,768][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:21:59,565][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:22:00,359][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:22:01,153][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:22:01,950][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:22:02,747][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:22:03,543][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:22:04,363][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:22:05,157][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:22:05,950][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:22:06,746][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:22:07,540][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:22:08,334][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:22:09,127][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:22:09,923][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:22:10,718][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:22:11,510][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:22:12,305][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:22:13,102][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:22:13,895][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:22:15,120][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:22:15,914][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:22:16,708][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:22:17,504][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:22:18,300][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:22:19,094][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:22:20,673][mllm.training.trainer_common][INFO] - For task: Apply reinforce step
,
            ΔVRAM Allocated: 2.288818359375e-05 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:26,
            Percentage of VRAM taken: 69.91579512747062%,
            
[2025-08-20 10:22:21,653][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/policy_optimizer_state.pt
[2025-08-20 10:22:21,654][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/critic_optimizer_state.pt
[2025-08-20 10:22:22,980][__main__][INFO] - Iteration 139 took 55s (38.28% Gen, 61.71% Train). Generation: 21s, Training: 34s. Estimated remaining time: 13h 10m 59s. Estimated total time: 15h 22m 51s. Time estimates for 10 more iterations: 9m 13s, 100 more iterations: 1h 32m 17s, 500 more iterations: 7h 41m 25s.
[2025-08-20 10:22:22,982][__main__][INFO] - Starting iteration 139.
[2025-08-20 10:22:46,901][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 10:22:46,902][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 10:22:46,908][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 10:22:49,351][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 10:22:49,352][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 10:22:49,358][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 10:22:49,361][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data.
[2025-08-20 10:22:49,361][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets.
[2025-08-20 10:22:49,660][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:22:50,449][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:22:51,240][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:22:52,031][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:22:52,822][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:22:53,614][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:22:54,409][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:22:55,202][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:22:55,994][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:22:56,789][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:22:57,583][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:22:58,375][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:22:59,170][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:22:59,965][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:23:00,758][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:23:01,551][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:23:02,347][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:23:03,142][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:23:03,934][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:23:04,729][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:23:05,525][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:23:06,317][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:23:07,540][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:23:08,334][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:23:09,129][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:23:09,922][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:23:10,716][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:23:11,509][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:23:12,306][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:23:13,104][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:23:13,899][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:23:14,693][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:23:16,250][mllm.training.trainer_common][INFO] - For task: Apply reinforce step
,
            ΔVRAM Allocated: 2.288818359375e-05 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:26,
            Percentage of VRAM taken: 69.91579512747062%,
            
[2025-08-20 10:23:17,210][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/policy_optimizer_state.pt
[2025-08-20 10:23:17,211][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/critic_optimizer_state.pt
[2025-08-20 10:23:18,611][__main__][INFO] - Iteration 140 took 55s (38.60% Gen, 61.39% Train). Generation: 21s, Training: 34s. Estimated remaining time: 13h 14m 21s. Estimated total time: 15h 27m 9s. Time estimates for 10 more iterations: 9m 16s, 100 more iterations: 1h 32m 42s, 500 more iterations: 7h 43m 34s.
[2025-08-20 10:23:18,613][__main__][INFO] - Starting iteration 140.
[2025-08-20 10:23:42,523][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 10:23:42,524][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 10:23:42,531][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 10:23:45,037][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 10:23:45,038][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 10:23:45,045][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 10:23:45,047][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data.
[2025-08-20 10:23:45,047][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets.
[2025-08-20 10:23:45,347][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:23:46,136][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:23:46,930][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:23:47,721][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:23:48,514][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:23:49,308][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:23:50,100][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:23:50,894][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:23:51,688][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:23:52,481][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:23:53,274][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:23:54,070][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:23:54,863][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:23:55,656][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:23:56,451][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:23:57,244][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:23:58,038][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:23:58,832][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:23:59,630][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:24:00,424][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:24:01,217][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:24:02,012][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:24:03,280][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:24:04,074][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:24:04,870][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:24:05,666][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:24:06,460][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:24:07,253][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:24:08,050][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:24:08,846][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:24:09,641][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:24:10,435][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:24:12,006][mllm.training.trainer_common][INFO] - For task: Apply reinforce step
,
            ΔVRAM Allocated: 2.288818359375e-05 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:26,
            Percentage of VRAM taken: 69.91579512747062%,
            
[2025-08-20 10:24:12,981][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/policy_optimizer_state.pt
[2025-08-20 10:24:12,983][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/critic_optimizer_state.pt
[2025-08-20 10:24:14,422][__main__][INFO] - Iteration 141 took 55s (38.37% Gen, 61.63% Train). Generation: 21s, Training: 34s. Estimated remaining time: 13h 16m 25s. Estimated total time: 15h 30m 9s. Time estimates for 10 more iterations: 9m 18s, 100 more iterations: 1h 33m 0s, 500 more iterations: 7h 45m 4s.
[2025-08-20 10:24:14,424][__main__][INFO] - Starting iteration 141.
[2025-08-20 10:24:40,131][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 10:24:40,133][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 10:24:40,139][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 10:24:42,598][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 10:24:42,599][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 10:24:42,605][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 10:24:42,608][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data.
[2025-08-20 10:24:42,608][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets.
[2025-08-20 10:24:42,911][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:24:43,702][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:24:44,496][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:24:45,291][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:24:46,084][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:24:46,878][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:24:47,675][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:24:48,470][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:24:49,264][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:24:50,057][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:24:50,855][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:24:51,650][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:24:52,444][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:24:53,238][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:24:54,035][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:24:54,832][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:24:55,628][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:24:56,422][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:24:57,215][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:24:58,011][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:24:58,805][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:24:59,598][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:25:00,795][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:25:01,591][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:25:02,386][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:25:03,179][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:25:03,973][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:25:04,769][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:25:05,565][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:25:06,358][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:25:07,152][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:25:07,949][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:25:09,536][mllm.training.trainer_common][INFO] - For task: Apply reinforce step
,
            ΔVRAM Allocated: 2.288818359375e-05 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:26,
            Percentage of VRAM taken: 69.91579512747062%,
            
[2025-08-20 10:25:10,485][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/policy_optimizer_state.pt
[2025-08-20 10:25:10,487][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/critic_optimizer_state.pt
[2025-08-20 10:25:11,788][__main__][INFO] - Iteration 142 took 57s (40.55% Gen, 59.45% Train). Generation: 23s, Training: 34s. Estimated remaining time: 13h 41m 23s. Estimated total time: 15h 56m 4s. Time estimates for 10 more iterations: 9m 33s, 100 more iterations: 1h 35m 36s, 500 more iterations: 7h 58m 2s.
[2025-08-20 10:25:11,790][__main__][INFO] - Starting iteration 142.
[2025-08-20 10:25:34,948][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 10:25:34,949][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 10:25:34,955][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 10:25:37,413][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 10:25:37,414][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 10:25:37,420][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 10:25:37,423][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data.
[2025-08-20 10:25:37,423][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets.
[2025-08-20 10:25:37,722][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:25:38,514][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:25:39,306][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:25:40,100][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:25:40,895][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:25:41,687][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:25:42,480][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:25:43,276][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:25:44,070][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:25:44,863][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:25:45,659][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:25:46,456][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:25:47,250][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:25:48,043][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:25:48,838][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:25:49,635][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:25:50,430][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:25:51,224][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:25:52,467][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:25:53,261][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:25:54,057][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:25:54,853][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:25:55,648][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:25:56,441][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:25:57,236][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:25:58,033][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:25:58,828][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:25:59,622][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:26:00,416][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:26:01,213][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:26:02,009][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:26:02,802][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:26:04,353][mllm.training.trainer_common][INFO] - For task: Apply reinforce step
,
            ΔVRAM Allocated: 2.288818359375e-05 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:26,
            Percentage of VRAM taken: 69.91579512747062%,
            
[2025-08-20 10:26:05,326][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/policy_optimizer_state.pt
[2025-08-20 10:26:05,327][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/critic_optimizer_state.pt
[2025-08-20 10:26:06,660][__main__][INFO] - Iteration 143 took 54s (37.74% Gen, 62.26% Train). Generation: 20s, Training: 34s. Estimated remaining time: 12h 58m 54s. Estimated total time: 15h 14m 30s. Time estimates for 10 more iterations: 9m 8s, 100 more iterations: 1h 31m 27s, 500 more iterations: 7h 37m 15s.
[2025-08-20 10:26:06,662][__main__][INFO] - Starting iteration 143.
[2025-08-20 10:26:30,902][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 10:26:30,903][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 10:26:30,909][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 10:26:33,414][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 10:26:33,415][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 10:26:33,422][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 10:26:33,424][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data.
[2025-08-20 10:26:33,424][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets.
[2025-08-20 10:26:33,726][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:26:34,520][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:26:35,313][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:26:36,106][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:26:36,903][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:26:37,699][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:26:38,493][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:26:39,286][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:26:40,082][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:26:40,879][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:26:41,674][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:26:42,468][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:26:43,263][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:26:44,061][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:26:44,860][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:26:45,658][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:26:46,455][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:26:47,250][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:26:48,046][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:26:48,840][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:26:49,636][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:26:50,430][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:26:51,223][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:26:52,018][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:26:53,285][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:26:54,078][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:26:54,873][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:26:55,666][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:26:56,458][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:26:57,255][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:26:58,050][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:26:58,842][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:27:00,419][mllm.training.trainer_common][INFO] - For task: Apply reinforce step
,
            ΔVRAM Allocated: 2.288818359375e-05 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:26,
            Percentage of VRAM taken: 69.91579512747062%,
            
[2025-08-20 10:27:01,373][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/policy_optimizer_state.pt
[2025-08-20 10:27:01,375][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/critic_optimizer_state.pt
[2025-08-20 10:27:02,625][__main__][INFO] - Iteration 144 took 55s (38.84% Gen, 61.15% Train). Generation: 21s, Training: 34s. Estimated remaining time: 13h 16m 10s. Estimated total time: 15h 32m 42s. Time estimates for 10 more iterations: 9m 19s, 100 more iterations: 1h 33m 16s, 500 more iterations: 7h 46m 21s.
[2025-08-20 10:27:02,626][__main__][INFO] - Starting iteration 144.
[2025-08-20 10:27:25,829][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 10:27:25,831][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 10:27:25,837][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 10:27:28,304][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 10:27:28,305][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 10:27:28,311][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 10:27:28,314][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data.
[2025-08-20 10:27:28,315][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets.
[2025-08-20 10:27:28,613][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:27:29,904][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:27:30,695][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:27:31,487][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:27:32,283][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:27:33,077][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:27:33,869][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:27:34,664][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:27:35,459][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:27:36,251][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:27:37,045][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:27:37,841][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:27:38,636][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:27:39,429][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:27:40,224][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:27:41,018][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:27:41,812][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:27:42,605][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:27:43,401][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:27:44,195][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:27:44,988][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:27:45,782][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:27:46,578][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:27:47,370][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:27:48,163][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:27:48,959][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:27:49,755][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:27:50,952][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:27:51,805][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:27:52,759][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:27:53,557][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:27:57,707][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:28:00,495][mllm.training.trainer_common][INFO] - For task: Apply reinforce step
,
            ΔVRAM Allocated: 2.288818359375e-05 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:32,
            Percentage of VRAM taken: 69.91579512747062%,
            
[2025-08-20 10:28:01,635][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/policy_optimizer_state.pt
[2025-08-20 10:28:01,637][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/critic_optimizer_state.pt
[2025-08-20 10:28:02,972][__main__][INFO] - Iteration 145 took 1m 0s (34.37% Gen, 65.63% Train). Generation: 20s, Training: 39s. Estimated remaining time: 14h 28m 13s. Estimated total time: 16h 45m 45s. Time estimates for 10 more iterations: 10m 3s, 100 more iterations: 1h 40m 34s, 500 more iterations: 8h 22m 52s.
[2025-08-20 10:28:02,974][__main__][INFO] - Starting iteration 145.
[2025-08-20 10:28:26,102][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 10:28:26,103][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 10:28:26,109][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 10:28:28,560][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 10:28:28,561][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 10:28:28,567][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 10:28:28,570][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data.
[2025-08-20 10:28:28,570][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets.
[2025-08-20 10:28:28,875][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:28:29,667][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:28:30,461][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:28:31,253][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:28:32,047][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:28:32,841][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:28:33,635][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:28:34,428][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:28:35,223][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:28:36,018][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:28:36,811][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:28:37,604][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:28:38,399][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:28:39,190][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:28:39,984][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:28:40,777][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:28:41,569][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:28:42,362][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:28:43,157][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:28:43,950][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:28:45,176][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:28:45,970][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:28:46,763][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:28:47,560][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:28:48,354][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:28:49,148][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:28:49,942][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:28:50,738][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:28:51,532][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:28:52,325][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:28:53,121][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:28:53,918][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:28:55,475][mllm.training.trainer_common][INFO] - For task: Apply reinforce step
,
            ΔVRAM Allocated: 2.288818359375e-05 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:26,
            Percentage of VRAM taken: 69.91579512747062%,
            
[2025-08-20 10:28:56,443][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/policy_optimizer_state.pt
[2025-08-20 10:28:56,445][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/critic_optimizer_state.pt
[2025-08-20 10:28:57,747][__main__][INFO] - Iteration 146 took 54s (37.75% Gen, 62.25% Train). Generation: 20s, Training: 34s. Estimated remaining time: 12h 54m 26s. Estimated total time: 15h 12m 52s. Time estimates for 10 more iterations: 9m 7s, 100 more iterations: 1h 31m 17s, 500 more iterations: 7h 36m 26s.
[2025-08-20 10:28:57,749][__main__][INFO] - Starting iteration 146.
[2025-08-20 10:29:21,400][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 10:29:21,402][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 10:29:21,408][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 10:29:23,876][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 10:29:23,877][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 10:29:23,884][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 10:29:23,886][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data.
[2025-08-20 10:29:23,886][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets.
[2025-08-20 10:29:24,191][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:29:24,983][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:29:25,776][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:29:26,570][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:29:27,368][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:29:28,162][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:29:28,956][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:29:29,750][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:29:30,547][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:29:31,342][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:29:32,136][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:29:32,928][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:29:33,723][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:29:34,515][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:29:35,309][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:29:36,105][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:29:36,899][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:29:37,692][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:29:38,486][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:29:39,282][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:29:40,076][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:29:41,344][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:29:42,139][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:29:42,932][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:29:43,725][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:29:44,521][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:29:45,316][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:29:46,109][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:29:46,905][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:29:47,702][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:29:48,498][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:29:49,292][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:29:50,858][mllm.training.trainer_common][INFO] - For task: Apply reinforce step
,
            ΔVRAM Allocated: 2.288818359375e-05 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:26,
            Percentage of VRAM taken: 69.91579512747062%,
            
[2025-08-20 10:29:51,886][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/policy_optimizer_state.pt
[2025-08-20 10:29:51,888][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/critic_optimizer_state.pt
[2025-08-20 10:29:53,133][__main__][INFO] - Iteration 147 took 55s (38.23% Gen, 61.77% Train). Generation: 21s, Training: 34s. Estimated remaining time: 13h 3m 41s. Estimated total time: 15h 23m 4s. Time estimates for 10 more iterations: 9m 13s, 100 more iterations: 1h 32m 18s, 500 more iterations: 7h 41m 32s.
[2025-08-20 10:29:53,135][__main__][INFO] - Starting iteration 147.
[2025-08-20 10:30:16,421][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 10:30:16,422][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 10:30:16,428][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 10:30:18,907][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 10:30:18,908][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 10:30:18,914][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 10:30:18,917][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data.
[2025-08-20 10:30:18,918][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets.
[2025-08-20 10:30:19,218][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:30:20,010][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:30:20,802][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:30:21,595][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:30:22,388][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:30:23,181][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:30:23,974][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:30:24,771][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:30:25,568][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:30:26,361][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:30:27,154][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:30:27,949][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:30:28,743][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:30:29,544][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:30:30,336][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:30:31,130][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:30:31,926][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:30:32,719][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:30:33,512][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:30:34,308][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:30:35,104][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:30:36,302][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:30:37,094][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:30:37,888][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:30:38,684][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:30:39,478][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:30:40,271][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:30:41,066][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:30:41,862][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:30:42,656][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:30:43,449][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:30:44,246][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:30:45,895][mllm.training.trainer_common][INFO] - For task: Apply reinforce step
,
            ΔVRAM Allocated: 2.288818359375e-05 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:26,
            Percentage of VRAM taken: 69.91579512747062%,
            
[2025-08-20 10:30:46,900][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/policy_optimizer_state.pt
[2025-08-20 10:30:46,902][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/critic_optimizer_state.pt
[2025-08-20 10:30:48,203][__main__][INFO] - Iteration 148 took 55s (37.77% Gen, 62.23% Train). Generation: 20s, Training: 34s. Estimated remaining time: 12h 57m 30s. Estimated total time: 15h 17m 47s. Time estimates for 10 more iterations: 9m 10s, 100 more iterations: 1h 31m 46s, 500 more iterations: 7h 38m 53s.
[2025-08-20 10:30:48,204][__main__][INFO] - Starting iteration 148.
[2025-08-20 10:31:13,428][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 10:31:13,429][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 10:31:13,435][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 10:31:15,904][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 10:31:15,905][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 10:31:15,912][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 10:31:15,914][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data.
[2025-08-20 10:31:15,915][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets.
[2025-08-20 10:31:16,214][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:31:17,005][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:31:17,795][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:31:19,088][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:31:22,291][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:31:23,084][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:31:23,877][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:31:24,670][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:31:25,462][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:31:26,254][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:31:27,049][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:31:27,841][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:31:28,633][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:31:29,430][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:31:30,224][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:31:31,016][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:31:31,811][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:31:32,607][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:31:33,400][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:31:34,193][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:31:34,989][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:31:35,784][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:31:37,127][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:31:37,921][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:31:38,714][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:31:39,508][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:31:40,304][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:31:41,098][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:31:41,891][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:31:42,687][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:31:43,482][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:31:44,276][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:31:45,878][mllm.training.trainer_common][INFO] - For task: Apply reinforce step
,
            ΔVRAM Allocated: 2.288818359375e-05 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:29,
            Percentage of VRAM taken: 69.91579512747062%,
            
[2025-08-20 10:31:46,857][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/policy_optimizer_state.pt
[2025-08-20 10:31:46,859][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/critic_optimizer_state.pt
[2025-08-20 10:31:48,140][__main__][INFO] - Iteration 149 took 59s (37.98% Gen, 62.02% Train). Generation: 22s, Training: 37s. Estimated remaining time: 14h 17m 38s. Estimated total time: 16h 38m 55s. Time estimates for 10 more iterations: 9m 59s, 100 more iterations: 1h 39m 53s, 500 more iterations: 8h 19m 27s.
[2025-08-20 10:31:48,141][__main__][INFO] - Starting iteration 149.
[2025-08-20 10:32:11,327][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 10:32:11,328][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 10:32:11,334][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 10:32:13,793][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 10:32:13,794][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 10:32:13,801][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 10:32:13,803][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data.
[2025-08-20 10:32:13,804][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets.
[2025-08-20 10:32:14,102][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:32:14,894][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:32:15,687][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:32:16,479][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:32:17,276][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:32:18,071][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:32:18,864][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:32:19,657][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:32:20,452][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:32:21,243][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:32:22,036][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:32:22,830][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:32:23,623][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:32:24,415][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:32:25,212][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:32:26,006][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:32:26,798][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:32:27,593][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:32:28,389][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:32:29,655][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:32:30,450][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:32:31,243][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:32:32,036][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:32:32,831][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:32:33,627][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:32:34,421][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:32:35,214][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:32:36,010][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:32:36,806][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:32:37,600][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:32:38,393][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:32:39,189][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:32:40,781][mllm.training.trainer_common][INFO] - For task: Apply reinforce step
,
            ΔVRAM Allocated: 2.288818359375e-05 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:26,
            Percentage of VRAM taken: 69.91579512747062%,
            
[2025-08-20 10:32:41,742][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/policy_optimizer_state.pt
[2025-08-20 10:32:41,743][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/critic_optimizer_state.pt
[2025-08-20 10:32:43,036][__main__][INFO] - Iteration 150 took 54s (37.75% Gen, 62.25% Train). Generation: 20s, Training: 34s. Estimated remaining time: 12h 52m 42s. Estimated total time: 15h 14m 54s. Time estimates for 10 more iterations: 9m 8s, 100 more iterations: 1h 31m 29s, 500 more iterations: 7h 37m 27s.
[2025-08-20 10:32:43,037][__main__][INFO] - Starting iteration 150.
[2025-08-20 10:33:06,190][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 10:33:06,192][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 10:33:06,198][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 10:33:08,629][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 10:33:08,631][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 10:33:08,637][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 10:33:08,640][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data.
[2025-08-20 10:33:08,640][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets.
[2025-08-20 10:33:08,940][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:33:09,732][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:33:10,525][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:33:11,319][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:33:12,114][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:33:12,907][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:33:13,700][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:33:14,496][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:33:15,288][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:33:16,080][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:33:16,876][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:33:17,669][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:33:18,461][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:33:19,257][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:33:20,052][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:33:20,845][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:33:21,638][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:33:22,434][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:33:23,229][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:33:24,518][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:33:25,315][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:33:26,110][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:33:26,904][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:33:27,697][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:33:28,493][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:33:29,290][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:33:30,083][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:33:30,877][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:33:31,674][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:33:32,470][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:33:33,264][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:33:34,059][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:33:35,629][mllm.training.trainer_common][INFO] - For task: Apply reinforce step
,
            ΔVRAM Allocated: 2.288818359375e-05 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:26,
            Percentage of VRAM taken: 69.91579512747062%,
            
[2025-08-20 10:33:36,576][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/policy_optimizer_state.pt
[2025-08-20 10:33:36,577][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/critic_optimizer_state.pt
[2025-08-20 10:33:41,010][__main__][INFO] - Iteration 151 took 57s (35.75% Gen, 58.94% Train). Generation: 20s, Training: 34s. Estimated remaining time: 13h 43m 2s. Estimated total time: 16h 6m 12s. Time estimates for 10 more iterations: 9m 39s, 100 more iterations: 1h 36m 37s, 500 more iterations: 8h 3m 6s.
[2025-08-20 10:33:41,012][__main__][INFO] - Starting iteration 151.
[2025-08-20 10:34:04,744][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 10:34:04,746][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 10:34:04,752][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 10:34:07,217][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 10:34:07,218][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 10:34:07,225][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 10:34:07,227][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data.
[2025-08-20 10:34:07,227][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets.
[2025-08-20 10:34:07,527][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:34:08,319][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:34:09,111][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:34:09,903][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:34:10,698][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:34:11,490][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:34:12,282][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:34:13,077][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:34:13,870][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:34:14,662][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:34:15,458][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:34:16,252][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:34:17,044][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:34:17,838][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:34:18,633][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:34:19,425][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:34:20,219][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:34:21,015][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:34:22,270][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:34:23,063][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:34:23,857][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:34:24,654][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:34:25,447][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:34:26,239][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:34:27,035][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:34:27,830][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:34:28,623][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:34:29,418][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:34:30,215][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:34:31,010][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:34:31,804][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:34:32,641][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:34:34,195][mllm.training.trainer_common][INFO] - For task: Apply reinforce step
,
            ΔVRAM Allocated: 2.288818359375e-05 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:26,
            Percentage of VRAM taken: 69.91579512747062%,
            
[2025-08-20 10:34:35,235][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/policy_optimizer_state.pt
[2025-08-20 10:34:35,237][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/critic_optimizer_state.pt
[2025-08-20 10:34:36,593][__main__][INFO] - Iteration 152 took 55s (38.25% Gen, 61.75% Train). Generation: 21s, Training: 34s. Estimated remaining time: 13h 2m 15s. Estimated total time: 15h 26m 20s. Time estimates for 10 more iterations: 9m 15s, 100 more iterations: 1h 32m 38s, 500 more iterations: 7h 43m 10s.
[2025-08-20 10:34:36,595][__main__][INFO] - Starting iteration 152.
[2025-08-20 10:34:59,808][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 10:34:59,809][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 10:34:59,815][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 10:35:02,285][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 10:35:02,286][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 10:35:02,292][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 10:35:02,294][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data.
[2025-08-20 10:35:02,295][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets.
[2025-08-20 10:35:02,594][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:35:03,382][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:35:04,173][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:35:04,965][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:35:05,759][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:35:06,551][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:35:07,343][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:35:08,139][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:35:08,931][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:35:09,724][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:35:10,519][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:35:11,312][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:35:12,105][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:35:12,900][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:35:13,695][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:35:14,488][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:35:15,282][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:35:16,078][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:35:16,873][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:35:17,666][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:35:18,461][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:35:19,259][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:35:20,055][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:35:21,308][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:35:22,101][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:35:22,898][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:35:23,694][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:35:24,487][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:35:25,281][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:35:26,077][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:35:26,873][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:35:27,667][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:35:29,285][mllm.training.trainer_common][INFO] - For task: Apply reinforce step
,
            ΔVRAM Allocated: 2.288818359375e-05 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:26,
            Percentage of VRAM taken: 69.91579512747062%,
            
[2025-08-20 10:35:30,243][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/policy_optimizer_state.pt
[2025-08-20 10:35:30,244][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/critic_optimizer_state.pt
[2025-08-20 10:35:31,545][__main__][INFO] - Iteration 153 took 54s (37.77% Gen, 62.23% Train). Generation: 20s, Training: 34s. Estimated remaining time: 12h 50m 49s. Estimated total time: 15h 15m 49s. Time estimates for 10 more iterations: 9m 9s, 100 more iterations: 1h 31m 34s, 500 more iterations: 7h 37m 54s.
[2025-08-20 10:35:31,546][__main__][INFO] - Starting iteration 153.
[2025-08-20 10:35:55,181][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 10:35:55,182][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 10:35:55,189][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 10:35:57,641][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 10:35:57,643][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 10:35:57,649][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 10:35:57,651][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data.
[2025-08-20 10:35:57,652][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets.
[2025-08-20 10:35:57,950][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:35:58,742][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:35:59,533][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:36:00,325][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:36:01,119][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:36:01,910][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:36:02,705][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:36:03,498][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:36:04,291][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:36:05,086][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:36:05,880][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:36:06,673][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:36:07,466][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:36:08,262][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:36:09,057][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:36:09,850][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:36:10,645][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:36:11,441][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:36:12,235][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:36:13,492][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:36:14,286][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:36:15,082][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:36:15,879][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:36:16,673][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:36:17,466][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:36:18,261][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:36:19,057][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:36:19,851][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:36:20,645][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:36:21,441][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:36:22,238][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:36:23,033][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:36:24,588][mllm.training.trainer_common][INFO] - For task: Apply reinforce step
,
            ΔVRAM Allocated: 2.288818359375e-05 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:26,
            Percentage of VRAM taken: 69.91579512747062%,
            
[2025-08-20 10:36:25,577][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/policy_optimizer_state.pt
[2025-08-20 10:36:25,579][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/critic_optimizer_state.pt
[2025-08-20 10:36:26,984][__main__][INFO] - Iteration 154 took 55s (38.21% Gen, 61.79% Train). Generation: 21s, Training: 34s. Estimated remaining time: 12h 58m 1s. Estimated total time: 15h 23m 57s. Time estimates for 10 more iterations: 9m 14s, 100 more iterations: 1h 32m 23s, 500 more iterations: 7h 41m 58s.
[2025-08-20 10:36:26,986][__main__][INFO] - Starting iteration 154.
[2025-08-20 10:36:50,386][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 10:36:50,388][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 10:36:50,394][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 10:36:52,867][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 10:36:52,868][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 10:36:52,874][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 10:36:52,877][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data.
[2025-08-20 10:36:52,877][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets.
[2025-08-20 10:36:53,176][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:36:53,967][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:36:54,759][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:36:55,551][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:36:56,346][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:36:57,139][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:36:57,932][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:36:58,727][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:36:59,523][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:37:00,316][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:37:01,110][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:37:01,906][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:37:02,701][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:37:03,495][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:37:04,288][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:37:05,084][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:37:05,879][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:37:06,673][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:37:07,467][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:37:08,264][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:37:09,059][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:37:09,853][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:37:11,162][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:37:11,957][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:37:12,751][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:37:13,545][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:37:14,342][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:37:15,137][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:37:15,931][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:37:16,725][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:37:17,522][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:37:18,319][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:37:19,889][mllm.training.trainer_common][INFO] - For task: Apply reinforce step
,
            ΔVRAM Allocated: 2.288818359375e-05 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:26,
            Percentage of VRAM taken: 69.91579512747062%,
            
[2025-08-20 10:37:20,832][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/policy_optimizer_state.pt
[2025-08-20 10:37:20,834][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/critic_optimizer_state.pt
[2025-08-20 10:37:25,780][__main__][INFO] - Iteration 155 took 58s (35.58% Gen, 64.42% Train). Generation: 20s, Training: 37s. Estimated remaining time: 13h 52m 59s. Estimated total time: 16h 19m 53s. Time estimates for 10 more iterations: 9m 47s, 100 more iterations: 1h 37m 59s, 500 more iterations: 8h 9m 56s.
[2025-08-20 10:37:25,782][__main__][INFO] - Starting iteration 155.
[2025-08-20 10:37:48,979][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 10:37:48,981][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 10:37:48,987][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 10:37:51,419][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 10:37:51,420][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 10:37:51,426][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 10:37:51,429][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data.
[2025-08-20 10:37:51,429][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets.
[2025-08-20 10:37:51,728][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:37:52,517][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:37:53,310][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:37:54,103][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:37:54,894][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:37:55,688][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:37:56,482][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:37:57,274][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:37:58,069][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:37:58,863][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:37:59,656][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:38:00,451][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:38:01,247][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:38:02,042][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:38:02,834][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:38:03,629][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:38:04,426][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:38:05,219][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:38:06,012][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:38:06,808][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:38:08,066][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:38:08,860][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:38:09,654][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:38:10,448][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:38:11,246][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:38:12,041][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:38:12,834][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:38:13,629][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:38:14,425][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:38:15,222][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:38:16,016][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:38:16,810][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:38:18,450][mllm.training.trainer_common][INFO] - For task: Apply reinforce step
,
            ΔVRAM Allocated: 2.288818359375e-05 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:26,
            Percentage of VRAM taken: 69.91579512747062%,
            
[2025-08-20 10:38:19,400][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/policy_optimizer_state.pt
[2025-08-20 10:38:19,401][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/critic_optimizer_state.pt
[2025-08-20 10:38:20,955][__main__][INFO] - Iteration 156 took 55s (37.63% Gen, 62.37% Train). Generation: 20s, Training: 34s. Estimated remaining time: 12h 51m 42s. Estimated total time: 15h 19m 32s. Time estimates for 10 more iterations: 9m 11s, 100 more iterations: 1h 31m 57s, 500 more iterations: 7h 39m 46s.
[2025-08-20 10:38:20,956][__main__][INFO] - Starting iteration 156.
[2025-08-20 10:38:44,209][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 10:38:44,210][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 10:38:44,217][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 10:38:46,653][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 10:38:46,654][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 10:38:46,660][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 10:38:46,663][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data.
[2025-08-20 10:38:46,663][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets.
[2025-08-20 10:38:46,960][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:38:47,750][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:38:48,539][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:38:49,331][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:38:50,122][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:38:50,913][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:38:51,706][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:38:52,496][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:38:53,290][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:38:54,081][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:38:54,872][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:38:55,664][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:38:56,455][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:38:57,249][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:38:58,040][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:38:58,832][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:38:59,628][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:39:00,420][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:39:01,212][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:39:02,007][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:39:02,800][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:39:03,593][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:39:04,388][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:39:05,596][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:39:06,389][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:39:07,185][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:39:07,978][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:39:08,771][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:39:09,567][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:39:10,362][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:39:11,157][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:39:11,951][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:39:13,533][mllm.training.trainer_common][INFO] - For task: Apply reinforce step
,
            ΔVRAM Allocated: 2.288818359375e-05 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:26,
            Percentage of VRAM taken: 69.91579512747062%,
            
[2025-08-20 10:39:14,543][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/policy_optimizer_state.pt
[2025-08-20 10:39:14,545][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/critic_optimizer_state.pt
[2025-08-20 10:39:15,994][__main__][INFO] - Iteration 157 took 55s (37.81% Gen, 62.19% Train). Generation: 20s, Training: 34s. Estimated remaining time: 12h 48m 32s. Estimated total time: 15h 17m 17s. Time estimates for 10 more iterations: 9m 10s, 100 more iterations: 1h 31m 43s, 500 more iterations: 7h 38m 38s.
[2025-08-20 10:39:15,996][__main__][INFO] - Starting iteration 157.
[2025-08-20 10:39:39,623][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 10:39:39,624][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 10:39:39,631][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 10:39:42,106][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 10:39:42,107][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 10:39:42,114][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 10:39:42,116][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data.
[2025-08-20 10:39:42,117][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets.
[2025-08-20 10:39:42,415][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:39:43,207][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:39:43,999][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:39:44,795][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:39:45,588][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:39:46,380][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:39:47,176][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:39:47,971][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:39:48,764][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:39:49,558][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:39:50,354][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:39:51,148][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:39:51,942][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:39:52,736][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:39:53,534][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:39:54,329][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:39:55,123][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:39:55,917][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:39:56,714][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:39:57,511][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:39:58,308][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:39:59,102][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:39:59,896][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:40:01,137][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:40:01,933][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:40:02,730][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:40:03,525][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:40:04,320][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:40:05,114][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:40:05,911][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:40:06,710][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:40:07,507][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:40:09,105][mllm.training.trainer_common][INFO] - For task: Apply reinforce step
,
            ΔVRAM Allocated: 2.288818359375e-05 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:26,
            Percentage of VRAM taken: 69.91579512747062%,
            
[2025-08-20 10:40:10,091][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/policy_optimizer_state.pt
[2025-08-20 10:40:10,093][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/critic_optimizer_state.pt
[2025-08-20 10:40:11,499][__main__][INFO] - Iteration 158 took 55s (38.14% Gen, 61.85% Train). Generation: 21s, Training: 34s. Estimated remaining time: 12h 55m 22s. Estimated total time: 15h 25m 3s. Time estimates for 10 more iterations: 9m 15s, 100 more iterations: 1h 32m 30s, 500 more iterations: 7h 42m 31s.
[2025-08-20 10:40:11,501][__main__][INFO] - Starting iteration 158.
[2025-08-20 10:40:36,187][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 10:40:36,188][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 10:40:36,194][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 10:40:38,656][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 10:40:38,657][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 10:40:38,663][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 10:40:38,666][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data.
[2025-08-20 10:40:38,666][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets.
[2025-08-20 10:40:38,964][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:40:39,756][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:40:40,547][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:40:41,339][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:40:42,134][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:40:42,926][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:40:43,719][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:40:44,514][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:40:45,307][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:40:46,100][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:40:46,896][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:40:47,692][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:40:48,484][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:40:49,279][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:40:50,075][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:40:50,871][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:40:51,667][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:40:52,460][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:40:53,256][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:40:54,056][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:40:54,853][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:40:55,648][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:40:56,910][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:40:57,703][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:40:58,497][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:40:59,294][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:41:00,092][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:41:00,888][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:41:01,682][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:41:02,476][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:41:03,273][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:41:04,069][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:41:05,708][mllm.training.trainer_common][INFO] - For task: Apply reinforce step
,
            ΔVRAM Allocated: 2.288818359375e-05 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:27,
            Percentage of VRAM taken: 69.91579512747062%,
            
[2025-08-20 10:41:06,663][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/policy_optimizer_state.pt
[2025-08-20 10:41:06,664][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/critic_optimizer_state.pt
[2025-08-20 10:41:07,948][__main__][INFO] - Iteration 159 took 56s (39.35% Gen, 60.65% Train). Generation: 22s, Training: 34s. Estimated remaining time: 13h 10m 10s. Estimated total time: 15h 40m 47s. Time estimates for 10 more iterations: 9m 24s, 100 more iterations: 1h 34m 4s, 500 more iterations: 7h 50m 23s.
[2025-08-20 10:41:07,950][__main__][INFO] - Starting iteration 159.
[2025-08-20 10:41:31,305][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 10:41:31,306][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 10:41:31,312][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 10:41:33,777][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 10:41:33,779][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 10:41:33,785][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 10:41:33,787][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data.
[2025-08-20 10:41:33,788][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets.
[2025-08-20 10:41:34,086][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:41:34,879][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:41:35,670][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:41:36,462][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:41:37,258][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:41:38,051][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:41:38,844][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:41:39,640][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:41:40,435][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:41:41,228][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:41:42,022][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:41:42,819][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:41:43,614][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:41:44,407][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:41:45,202][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:41:45,998][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:41:46,794][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:41:47,588][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:41:48,382][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:41:49,179][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:41:49,977][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:41:51,187][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:41:51,979][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:41:52,775][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:41:53,567][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:41:54,360][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:41:55,155][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:41:55,949][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:41:56,741][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:41:57,535][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:41:58,331][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:41:59,125][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:42:00,724][mllm.training.trainer_common][INFO] - For task: Apply reinforce step
,
            ΔVRAM Allocated: 2.288818359375e-05 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:26,
            Percentage of VRAM taken: 69.91579512747062%,
            
[2025-08-20 10:42:01,720][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/policy_optimizer_state.pt
[2025-08-20 10:42:01,722][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/critic_optimizer_state.pt
[2025-08-20 10:42:03,093][__main__][INFO] - Iteration 160 took 55s (37.89% Gen, 62.11% Train). Generation: 20s, Training: 34s. Estimated remaining time: 12h 47m 30s. Estimated total time: 15h 19m 2s. Time estimates for 10 more iterations: 9m 11s, 100 more iterations: 1h 31m 54s, 500 more iterations: 7h 39m 31s.
[2025-08-20 10:42:03,094][__main__][INFO] - Starting iteration 160.
[2025-08-20 10:42:26,559][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 10:42:26,560][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 10:42:26,566][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 10:42:29,010][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 10:42:29,011][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 10:42:29,018][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 10:42:29,020][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data.
[2025-08-20 10:42:29,020][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets.
[2025-08-20 10:42:29,322][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:42:30,112][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:42:30,905][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:42:31,700][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:42:32,492][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:42:33,285][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:42:34,079][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:42:34,873][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:42:35,666][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:42:36,462][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:42:37,257][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:42:38,050][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:42:38,843][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:42:39,641][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:42:40,436][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:42:41,230][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:42:42,024][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:42:42,821][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:42:43,617][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:42:44,411][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:42:45,205][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:42:46,001][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:42:46,800][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:42:47,597][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:42:48,393][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:42:49,187][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:42:49,981][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:42:50,779][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:42:52,053][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:42:52,848][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:42:53,643][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:42:54,439][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:42:56,004][mllm.training.trainer_common][INFO] - For task: Apply reinforce step
,
            ΔVRAM Allocated: 2.288818359375e-05 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:26,
            Percentage of VRAM taken: 69.91579512747062%,
            
[2025-08-20 10:42:56,962][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/policy_optimizer_state.pt
[2025-08-20 10:42:56,963][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/critic_optimizer_state.pt
[2025-08-20 10:42:58,323][__main__][INFO] - Iteration 161 took 55s (38.08% Gen, 61.91% Train). Generation: 21s, Training: 34s. Estimated remaining time: 12h 48m 0s. Estimated total time: 15h 20m 28s. Time estimates for 10 more iterations: 9m 12s, 100 more iterations: 1h 32m 2s, 500 more iterations: 7h 40m 14s.
[2025-08-20 10:42:58,325][__main__][INFO] - Starting iteration 161.
[2025-08-20 10:43:21,633][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 10:43:21,634][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 10:43:21,641][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 10:43:24,093][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 10:43:24,095][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 10:43:24,101][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 10:43:24,103][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data.
[2025-08-20 10:43:24,104][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets.
[2025-08-20 10:43:24,402][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:43:25,192][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:43:25,986][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:43:26,781][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:43:27,573][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:43:28,367][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:43:29,164][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:43:29,957][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:43:30,750][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:43:31,546][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:43:32,342][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:43:33,136][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:43:33,929][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:43:34,725][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:43:35,521][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:43:36,314][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:43:37,109][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:43:37,903][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:43:39,143][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:43:39,939][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:43:40,732][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:43:41,527][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:43:42,324][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:43:43,122][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:43:43,918][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:43:44,712][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:43:45,506][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:43:46,301][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:43:47,098][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:43:47,893][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:43:48,687][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:43:49,484][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:43:51,120][mllm.training.trainer_common][INFO] - For task: Apply reinforce step
,
            ΔVRAM Allocated: 2.288818359375e-05 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:26,
            Percentage of VRAM taken: 69.91579512747062%,
            
[2025-08-20 10:43:52,083][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/policy_optimizer_state.pt
[2025-08-20 10:43:52,084][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/critic_optimizer_state.pt
[2025-08-20 10:43:53,523][__main__][INFO] - Iteration 162 took 55s (37.79% Gen, 62.21% Train). Generation: 20s, Training: 34s. Estimated remaining time: 12h 46m 35s. Estimated total time: 15h 19m 57s. Time estimates for 10 more iterations: 9m 11s, 100 more iterations: 1h 31m 59s, 500 more iterations: 7h 39m 58s.
[2025-08-20 10:43:53,524][__main__][INFO] - Starting iteration 162.
[2025-08-20 10:44:17,176][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 10:44:17,177][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 10:44:17,183][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 10:44:19,639][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 10:44:19,640][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 10:44:19,647][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 10:44:19,649][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data.
[2025-08-20 10:44:19,649][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets.
[2025-08-20 10:44:19,947][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:44:20,736][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:44:21,528][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:44:22,318][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:44:23,109][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:44:23,901][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:44:24,692][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:44:25,486][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:44:26,279][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:44:27,070][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:44:27,866][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:44:28,658][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:44:29,451][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:44:30,244][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:44:31,036][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:44:31,828][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:44:32,624][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:44:33,417][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:44:34,619][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:44:35,413][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:44:36,208][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:44:37,005][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:44:37,801][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:44:38,595][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:44:39,389][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:44:40,185][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:44:40,983][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:44:41,780][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:44:42,575][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:44:43,368][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:44:44,165][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:44:44,963][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:44:46,567][mllm.training.trainer_common][INFO] - For task: Apply reinforce step
,
            ΔVRAM Allocated: 2.288818359375e-05 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:26,
            Percentage of VRAM taken: 69.91579512747062%,
            
[2025-08-20 10:44:47,558][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/policy_optimizer_state.pt
[2025-08-20 10:44:47,560][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/critic_optimizer_state.pt
[2025-08-20 10:44:48,897][__main__][INFO] - Iteration 163 took 55s (38.29% Gen, 61.71% Train). Generation: 21s, Training: 34s. Estimated remaining time: 12h 48m 34s. Estimated total time: 15h 22m 51s. Time estimates for 10 more iterations: 9m 13s, 100 more iterations: 1h 32m 17s, 500 more iterations: 7h 41m 25s.
[2025-08-20 10:44:48,898][__main__][INFO] - Starting iteration 163.
[2025-08-20 10:45:12,147][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 10:45:12,149][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 10:45:12,155][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 10:45:14,576][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 10:45:14,577][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 10:45:14,583][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 10:45:14,585][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data.
[2025-08-20 10:45:14,586][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets.
[2025-08-20 10:45:14,886][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:45:15,676][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:45:16,471][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:45:17,264][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:45:18,056][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:45:18,852][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:45:19,646][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:45:20,438][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:45:21,248][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:45:22,041][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:45:25,600][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:45:26,392][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:45:27,188][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:45:27,982][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:45:28,775][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:45:29,570][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:45:30,367][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:45:31,162][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:45:31,955][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:45:32,749][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:45:33,544][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:45:34,338][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:45:35,131][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:45:35,927][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:45:37,178][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:45:37,971][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:45:38,766][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:45:39,563][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:45:40,357][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:45:41,152][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:45:41,947][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:45:42,744][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:45:44,378][mllm.training.trainer_common][INFO] - For task: Apply reinforce step
,
            ΔVRAM Allocated: 2.288818359375e-05 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:29,
            Percentage of VRAM taken: 69.91579512747062%,
            
[2025-08-20 10:45:45,341][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/policy_optimizer_state.pt
[2025-08-20 10:45:45,342][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/critic_optimizer_state.pt
[2025-08-20 10:45:46,661][__main__][INFO] - Iteration 164 took 57s (36.02% Gen, 63.98% Train). Generation: 20s, Training: 36s. Estimated remaining time: 13h 27m 27s. Estimated total time: 16h 2m 42s. Time estimates for 10 more iterations: 9m 37s, 100 more iterations: 1h 36m 16s, 500 more iterations: 8h 1m 21s.
[2025-08-20 10:45:46,663][__main__][INFO] - Starting iteration 164.
[2025-08-20 10:46:10,451][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 10:46:10,452][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 10:46:10,458][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 10:46:12,921][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 10:46:12,922][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 10:46:12,928][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 10:46:12,930][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data.
[2025-08-20 10:46:12,931][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets.
[2025-08-20 10:46:13,229][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:46:14,017][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:46:14,810][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:46:15,601][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:46:16,393][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:46:17,186][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:46:17,977][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:46:18,771][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:46:19,563][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:46:20,354][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:46:21,150][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:46:21,942][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:46:22,894][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:46:23,691][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:46:24,485][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:46:25,278][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:46:26,074][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:46:26,870][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:46:29,934][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:46:30,729][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:46:31,525][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:46:32,318][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:46:33,111][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:46:33,907][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:46:34,702][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:46:35,496][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:46:36,290][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:46:37,086][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:46:38,337][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:46:39,131][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:46:39,928][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:46:40,726][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:46:42,374][mllm.training.trainer_common][INFO] - For task: Apply reinforce step
,
            ΔVRAM Allocated: 2.288818359375e-05 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:29,
            Percentage of VRAM taken: 69.91579512747062%,
            
[2025-08-20 10:46:43,342][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/policy_optimizer_state.pt
[2025-08-20 10:46:43,344][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/critic_optimizer_state.pt
[2025-08-20 10:46:44,723][__main__][INFO] - Iteration 165 took 58s (36.73% Gen, 63.27% Train). Generation: 21s, Training: 36s. Estimated remaining time: 13h 31m 25s. Estimated total time: 16h 7m 39s. Time estimates for 10 more iterations: 9m 40s, 100 more iterations: 1h 36m 45s, 500 more iterations: 8h 3m 49s.
[2025-08-20 10:46:44,724][__main__][INFO] - Starting iteration 165.
[2025-08-20 10:47:07,994][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 10:47:07,995][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 10:47:08,001][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 10:47:10,460][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 10:47:10,462][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 10:47:10,468][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 10:47:10,470][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data.
[2025-08-20 10:47:10,471][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets.
[2025-08-20 10:47:10,769][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:47:11,558][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:47:12,352][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:47:13,144][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:47:13,936][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:47:14,730][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:47:15,522][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:47:16,316][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:47:17,111][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:47:17,904][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:47:18,697][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:47:19,493][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:47:20,287][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:47:21,080][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:47:21,875][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:47:22,670][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:47:23,463][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:47:24,256][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:47:25,052][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:47:25,847][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:47:26,641][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:47:27,434][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:47:28,231][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:47:29,540][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:47:30,334][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:47:31,131][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:47:31,928][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:47:32,725][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:47:33,520][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:47:34,315][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:47:35,112][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:47:35,909][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:47:37,538][mllm.training.trainer_common][INFO] - For task: Apply reinforce step
,
            ΔVRAM Allocated: 2.288818359375e-05 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:27,
            Percentage of VRAM taken: 69.91579512747062%,
            
[2025-08-20 10:47:38,513][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/policy_optimizer_state.pt
[2025-08-20 10:47:38,514][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/critic_optimizer_state.pt
[2025-08-20 10:47:40,158][__main__][INFO] - Iteration 166 took 55s (37.55% Gen, 62.45% Train). Generation: 20s, Training: 34s. Estimated remaining time: 12h 46m 44s. Estimated total time: 15h 23m 53s. Time estimates for 10 more iterations: 9m 14s, 100 more iterations: 1h 32m 23s, 500 more iterations: 7h 41m 56s.
[2025-08-20 10:47:40,160][__main__][INFO] - Starting iteration 166.
[2025-08-20 10:48:03,363][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 10:48:03,365][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 10:48:03,371][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 10:48:05,821][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 10:48:05,822][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 10:48:05,828][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 10:48:05,830][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data.
[2025-08-20 10:48:05,831][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets.
[2025-08-20 10:48:06,130][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:48:06,920][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:48:07,714][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:48:08,506][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:48:09,300][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:48:10,096][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:48:10,889][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:48:11,682][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:48:12,477][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:48:13,273][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:48:14,067][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:48:14,860][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:48:15,657][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:48:16,453][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:48:17,247][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:48:18,040][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:48:18,836][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:48:19,632][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:48:20,425][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:48:21,219][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:48:22,014][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:48:23,352][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:48:24,148][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:48:24,941][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:48:25,737][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:48:26,535][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:48:27,333][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:48:28,130][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:48:28,926][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:48:29,720][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:48:30,516][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:48:31,314][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:48:32,939][mllm.training.trainer_common][INFO] - For task: Apply reinforce step
,
            ΔVRAM Allocated: 2.288818359375e-05 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:27,
            Percentage of VRAM taken: 69.91579512747062%,
            
[2025-08-20 10:48:33,929][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/policy_optimizer_state.pt
[2025-08-20 10:48:33,930][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/critic_optimizer_state.pt
[2025-08-20 10:48:35,262][__main__][INFO] - Iteration 167 took 55s (37.66% Gen, 62.34% Train). Generation: 20s, Training: 34s. Estimated remaining time: 12h 40m 17s. Estimated total time: 15h 18m 21s. Time estimates for 10 more iterations: 9m 11s, 100 more iterations: 1h 31m 50s, 500 more iterations: 7h 39m 10s.
[2025-08-20 10:48:35,263][__main__][INFO] - Starting iteration 167.
[2025-08-20 10:48:58,745][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 10:48:58,746][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 10:48:58,753][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 10:49:01,202][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 10:49:01,204][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 10:49:01,210][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 10:49:01,213][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data.
[2025-08-20 10:49:01,213][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets.
[2025-08-20 10:49:01,512][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:49:02,303][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:49:03,098][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:49:03,889][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:49:04,682][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:49:05,477][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:49:06,269][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:49:07,062][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:49:07,859][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:49:08,653][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:49:09,447][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:49:10,242][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:49:11,039][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:49:11,835][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:49:12,628][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:49:13,421][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:49:14,216][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:49:15,010][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:49:15,804][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:49:16,600][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:49:17,868][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:49:18,662][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:49:19,458][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:49:20,255][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:49:21,052][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:49:21,848][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:49:22,643][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:49:23,439][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:49:24,237][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:49:25,036][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:49:25,835][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:49:26,634][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:49:28,283][mllm.training.trainer_common][INFO] - For task: Apply reinforce step
,
            ΔVRAM Allocated: 2.288818359375e-05 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:27,
            Percentage of VRAM taken: 69.91579512747062%,
            
[2025-08-20 10:49:29,265][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/policy_optimizer_state.pt
[2025-08-20 10:49:29,267][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/critic_optimizer_state.pt
[2025-08-20 10:49:30,662][__main__][INFO] - Iteration 168 took 55s (37.99% Gen, 62.01% Train). Generation: 21s, Training: 34s. Estimated remaining time: 12h 44m 18s. Estimated total time: 15h 23m 17s. Time estimates for 10 more iterations: 9m 13s, 100 more iterations: 1h 32m 19s, 500 more iterations: 7h 41m 38s.
[2025-08-20 10:49:30,663][__main__][INFO] - Starting iteration 168.
[2025-08-20 10:49:53,870][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 10:49:53,872][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 10:49:53,878][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 10:49:56,328][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 10:49:56,330][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 10:49:56,336][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 10:49:56,338][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data.
[2025-08-20 10:49:56,339][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets.
[2025-08-20 10:49:56,638][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:49:57,427][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:49:58,223][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:49:59,016][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:49:59,808][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:50:00,604][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:50:01,398][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:50:02,192][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:50:02,986][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:50:03,782][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:50:04,577][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:50:05,370][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:50:06,165][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:50:06,961][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:50:07,757][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:50:08,550][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:50:09,343][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:50:10,140][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:50:10,934][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:50:11,728][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:50:12,523][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:50:13,320][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:50:14,116][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:50:14,911][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:50:15,706][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:50:16,502][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:50:17,300][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:50:18,097][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:50:19,387][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:50:20,182][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:50:20,976][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:50:21,772][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:50:23,374][mllm.training.trainer_common][INFO] - For task: Apply reinforce step
,
            ΔVRAM Allocated: 2.288818359375e-05 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:26,
            Percentage of VRAM taken: 69.91579512747062%,
            
[2025-08-20 10:50:24,381][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/policy_optimizer_state.pt
[2025-08-20 10:50:24,383][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/critic_optimizer_state.pt
[2025-08-20 10:50:25,901][__main__][INFO] - Iteration 169 took 55s (37.59% Gen, 62.41% Train). Generation: 20s, Training: 34s. Estimated remaining time: 12h 40m 42s. Estimated total time: 15h 20m 37s. Time estimates for 10 more iterations: 9m 12s, 100 more iterations: 1h 32m 3s, 500 more iterations: 7h 40m 18s.
[2025-08-20 10:50:25,903][__main__][INFO] - Starting iteration 169.
[2025-08-20 10:50:49,102][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 10:50:49,103][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 10:50:49,110][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 10:50:51,575][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 10:50:51,577][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 10:50:51,583][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 10:50:51,585][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data.
[2025-08-20 10:50:51,586][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets.
[2025-08-20 10:50:51,886][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:50:52,677][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:50:53,469][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:50:54,265][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:50:55,060][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:50:55,852][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:50:56,647][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:50:57,442][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:50:58,236][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:50:59,028][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:50:59,825][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:51:00,620][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:51:01,412][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:51:02,206][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:51:03,003][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:51:03,799][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:51:04,592][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:51:05,386][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:51:06,183][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:51:06,980][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:51:07,775][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:51:08,569][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:51:09,366][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:51:10,651][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:51:11,446][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:51:12,244][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:51:13,042][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:51:13,839][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:51:14,635][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:51:15,430][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:51:16,225][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:51:17,023][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:51:18,629][mllm.training.trainer_common][INFO] - For task: Apply reinforce step
,
            ΔVRAM Allocated: 2.288818359375e-05 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:27,
            Percentage of VRAM taken: 69.91579512747062%,
            
[2025-08-20 10:51:19,577][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/policy_optimizer_state.pt
[2025-08-20 10:51:19,578][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/critic_optimizer_state.pt
[2025-08-20 10:51:20,819][__main__][INFO] - Iteration 170 took 54s (37.76% Gen, 62.24% Train). Generation: 20s, Training: 34s. Estimated remaining time: 12h 34m 26s. Estimated total time: 15h 15m 16s. Time estimates for 10 more iterations: 9m 9s, 100 more iterations: 1h 31m 31s, 500 more iterations: 7h 37m 38s.
[2025-08-20 10:51:20,821][__main__][INFO] - Starting iteration 170.
[2025-08-20 10:51:43,978][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 10:51:43,979][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 10:51:43,986][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 10:51:46,426][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 10:51:46,428][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 10:51:46,434][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 10:51:46,436][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data.
[2025-08-20 10:51:46,437][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets.
[2025-08-20 10:51:46,735][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:51:47,528][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:51:48,320][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:51:49,112][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:51:49,908][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:51:50,701][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:51:51,493][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:51:52,289][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:51:53,085][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:51:53,879][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:51:54,672][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:51:55,468][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:51:56,264][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:51:57,057][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:51:57,851][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:51:58,647][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:51:59,444][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:52:00,237][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:52:01,031][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:52:02,248][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:52:03,042][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:52:03,836][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:52:04,633][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:52:05,428][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:52:06,227][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:52:07,025][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:52:07,822][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:52:08,618][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:52:09,414][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:52:10,210][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:52:11,007][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:52:11,805][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:52:13,435][mllm.training.trainer_common][INFO] - For task: Apply reinforce step
,
            ΔVRAM Allocated: 2.288818359375e-05 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:26,
            Percentage of VRAM taken: 69.91579512747062%,
            
[2025-08-20 10:52:14,389][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/policy_optimizer_state.pt
[2025-08-20 10:52:14,391][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/critic_optimizer_state.pt
[2025-08-20 10:52:15,859][__main__][INFO] - Iteration 171 took 55s (37.63% Gen, 62.36% Train). Generation: 20s, Training: 34s. Estimated remaining time: 12h 35m 32s. Estimated total time: 15h 17m 17s. Time estimates for 10 more iterations: 9m 10s, 100 more iterations: 1h 31m 43s, 500 more iterations: 7h 38m 38s.
[2025-08-20 10:52:15,860][__main__][INFO] - Starting iteration 171.
[2025-08-20 10:52:39,042][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 10:52:39,043][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 10:52:39,049][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 10:52:41,481][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 10:52:41,483][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 10:52:41,489][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 10:52:41,491][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data.
[2025-08-20 10:52:41,492][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets.
[2025-08-20 10:52:41,790][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:52:42,581][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:52:43,375][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:52:44,170][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:52:44,964][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:52:45,756][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:52:46,553][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:52:47,348][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:52:48,143][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:52:48,936][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:52:49,731][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:52:50,527][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:52:51,320][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:52:52,113][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:52:52,910][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:52:53,706][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:52:54,499][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:52:55,293][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:52:56,091][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:52:56,888][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:52:57,684][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:52:58,478][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:52:59,723][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:53:00,516][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:53:01,311][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:53:02,108][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:53:02,906][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:53:03,703][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:53:04,499][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:53:05,294][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:53:06,090][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:53:06,888][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:53:08,497][mllm.training.trainer_common][INFO] - For task: Apply reinforce step
,
            ΔVRAM Allocated: 2.288818359375e-05 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:26,
            Percentage of VRAM taken: 69.91579512747062%,
            
[2025-08-20 10:53:09,487][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/policy_optimizer_state.pt
[2025-08-20 10:53:09,488][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/critic_optimizer_state.pt
[2025-08-20 10:53:10,801][__main__][INFO] - Iteration 172 took 54s (37.74% Gen, 62.26% Train). Generation: 20s, Training: 34s. Estimated remaining time: 12h 33m 0s. Estimated total time: 15h 15m 40s. Time estimates for 10 more iterations: 9m 9s, 100 more iterations: 1h 31m 34s, 500 more iterations: 7h 37m 50s.
[2025-08-20 10:53:10,803][__main__][INFO] - Starting iteration 172.
[2025-08-20 10:53:34,337][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 10:53:34,338][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 10:53:34,345][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 10:53:36,812][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 10:53:36,813][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 10:53:36,820][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 10:53:36,822][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data.
[2025-08-20 10:53:36,823][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets.
[2025-08-20 10:53:37,123][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:53:37,914][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:53:38,706][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:53:39,498][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:53:40,294][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:53:41,089][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:53:41,881][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:53:42,675][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:53:43,470][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:53:44,263][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:53:45,056][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:53:45,853][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:53:46,648][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:53:47,441][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:53:48,235][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:53:49,032][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:53:49,828][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:53:50,622][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:53:51,415][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:53:52,213][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:53:53,500][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:53:54,295][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:53:55,091][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:53:55,887][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:53:56,680][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:53:57,474][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:53:58,270][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:53:59,068][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:53:59,864][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:54:00,659][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:54:01,454][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:54:02,250][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:54:03,868][mllm.training.trainer_common][INFO] - For task: Apply reinforce step
,
            ΔVRAM Allocated: 2.288818359375e-05 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:27,
            Percentage of VRAM taken: 69.91579512747062%,
            
[2025-08-20 10:54:04,853][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/policy_optimizer_state.pt
[2025-08-20 10:54:04,855][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/critic_optimizer_state.pt
[2025-08-20 10:54:06,155][__main__][INFO] - Iteration 173 took 55s (38.08% Gen, 61.92% Train). Generation: 21s, Training: 34s. Estimated remaining time: 12h 38m 56s. Estimated total time: 15h 22m 31s. Time estimates for 10 more iterations: 9m 13s, 100 more iterations: 1h 32m 15s, 500 more iterations: 7h 41m 15s.
[2025-08-20 10:54:06,156][__main__][INFO] - Starting iteration 173.
[2025-08-20 10:54:29,459][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 10:54:29,460][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 10:54:29,466][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 10:54:31,925][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 10:54:31,926][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 10:54:31,933][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 10:54:31,935][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data.
[2025-08-20 10:54:31,935][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets.
[2025-08-20 10:54:32,233][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:54:33,024][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:54:33,818][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:54:34,612][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:54:35,404][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:54:36,199][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:54:36,995][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:54:37,788][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:54:38,581][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:54:39,376][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:54:40,171][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:54:40,964][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:54:41,758][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:54:42,554][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:54:43,347][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:54:44,141][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:54:44,935][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:54:45,730][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:54:46,524][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:54:47,733][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:54:48,529][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:54:49,322][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:54:50,116][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:54:50,915][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:54:51,712][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:54:52,508][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:54:53,303][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:54:54,098][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:54:54,895][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:54:55,693][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:54:56,490][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:54:57,286][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:54:58,898][mllm.training.trainer_common][INFO] - For task: Apply reinforce step
,
            ΔVRAM Allocated: 2.288818359375e-05 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:26,
            Percentage of VRAM taken: 69.91579512747062%,
            
[2025-08-20 10:54:59,849][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/policy_optimizer_state.pt
[2025-08-20 10:54:59,850][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/critic_optimizer_state.pt
[2025-08-20 10:55:01,164][__main__][INFO] - Iteration 174 took 55s (37.90% Gen, 62.10% Train). Generation: 20s, Training: 34s. Estimated remaining time: 12h 32m 17s. Estimated total time: 15h 16m 47s. Time estimates for 10 more iterations: 9m 10s, 100 more iterations: 1h 31m 40s, 500 more iterations: 7h 38m 23s.
[2025-08-20 10:55:01,165][__main__][INFO] - Starting iteration 174.
[2025-08-20 10:55:24,379][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 10:55:24,381][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 10:55:24,387][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 10:55:26,857][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 10:55:26,858][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 10:55:26,864][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 10:55:26,867][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data.
[2025-08-20 10:55:26,867][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets.
[2025-08-20 10:55:27,166][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:55:27,959][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:55:28,751][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:55:29,544][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:55:30,340][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:55:31,134][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:55:31,927][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:55:32,721][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:55:33,518][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:55:34,314][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:55:35,108][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:55:35,902][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:55:36,698][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:55:37,494][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:55:38,287][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:55:39,081][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:55:39,879][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:55:40,674][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:55:41,468][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:55:42,261][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:55:43,058][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:55:44,378][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:55:45,176][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:55:45,973][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:55:46,768][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:55:47,562][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:55:48,358][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:55:49,156][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:55:49,954][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:55:50,751][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:55:51,547][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:55:52,341][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:55:53,967][mllm.training.trainer_common][INFO] - For task: Apply reinforce step
,
            ΔVRAM Allocated: 2.288818359375e-05 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:27,
            Percentage of VRAM taken: 69.91579512747062%,
            
[2025-08-20 10:55:54,958][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/policy_optimizer_state.pt
[2025-08-20 10:55:55,474][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/critic_optimizer_state.pt
[2025-08-20 10:56:03,791][__main__][INFO] - Iteration 175 took 1m 2s (33.13% Gen, 66.87% Train). Generation: 20s, Training: 41s. Estimated remaining time: 14h 38m 12s. Estimated total time: 17h 23m 45s. Time estimates for 10 more iterations: 10m 26s, 100 more iterations: 1h 44m 22s, 500 more iterations: 8h 41m 52s.
[2025-08-20 10:56:03,793][__main__][INFO] - Starting iteration 175.
[2025-08-20 10:56:26,869][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 10:56:26,870][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 10:56:26,877][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 10:56:29,351][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 10:56:29,352][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 10:56:29,359][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 10:56:29,361][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data.
[2025-08-20 10:56:29,362][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets.
[2025-08-20 10:56:29,661][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:56:30,452][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:56:31,244][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:56:32,039][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:56:32,831][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:56:33,623][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:56:34,418][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:56:35,210][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:56:36,002][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:56:36,797][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:56:37,590][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:56:38,383][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:56:39,179][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:56:39,975][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:56:40,768][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:56:41,561][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:56:42,358][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:56:43,155][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:56:43,949][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:56:44,742][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:56:45,539][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:56:46,840][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:56:47,637][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:56:48,434][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:56:49,228][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:56:50,022][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:56:50,818][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:56:51,615][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:56:52,411][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:56:53,206][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:56:54,001][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:56:54,797][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:56:56,417][mllm.training.trainer_common][INFO] - For task: Apply reinforce step
,
            ΔVRAM Allocated: 2.288818359375e-05 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:27,
            Percentage of VRAM taken: 69.91579512747062%,
            
[2025-08-20 10:56:57,381][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/policy_optimizer_state.pt
[2025-08-20 10:56:57,382][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/critic_optimizer_state.pt
[2025-08-20 10:56:58,668][__main__][INFO] - Iteration 176 took 54s (37.56% Gen, 62.44% Train). Generation: 20s, Training: 34s. Estimated remaining time: 12h 28m 7s. Estimated total time: 15h 14m 35s. Time estimates for 10 more iterations: 9m 8s, 100 more iterations: 1h 31m 27s, 500 more iterations: 7h 37m 17s.
[2025-08-20 10:56:58,670][__main__][INFO] - Starting iteration 176.
[2025-08-20 10:57:21,808][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 10:57:21,809][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 10:57:21,815][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 10:57:24,275][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 10:57:24,276][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 10:57:24,282][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 10:57:24,285][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data.
[2025-08-20 10:57:24,285][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets.
[2025-08-20 10:57:24,583][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:57:25,374][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:57:26,166][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:57:26,961][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:57:27,753][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:57:28,545][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:57:29,340][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:57:30,133][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:57:30,926][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:57:31,722][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:57:32,515][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:57:33,308][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:57:34,105][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:57:34,902][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:57:35,696][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:57:36,489][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:57:37,285][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:57:38,082][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:57:38,879][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:57:39,673][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:57:40,468][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:57:41,263][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:57:42,062][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:57:42,860][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:57:44,090][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:57:44,884][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:57:45,681][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:57:46,479][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:57:47,277][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:57:48,073][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:57:48,869][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:57:49,663][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:57:51,313][mllm.training.trainer_common][INFO] - For task: Apply reinforce step
,
            ΔVRAM Allocated: 2.288818359375e-05 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:26,
            Percentage of VRAM taken: 69.91579512747062%,
            
[2025-08-20 10:57:52,320][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/policy_optimizer_state.pt
[2025-08-20 10:57:52,322][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/critic_optimizer_state.pt
[2025-08-20 10:57:53,638][__main__][INFO] - Iteration 177 took 54s (37.66% Gen, 62.34% Train). Generation: 20s, Training: 34s. Estimated remaining time: 12h 28m 45s. Estimated total time: 15h 16m 8s. Time estimates for 10 more iterations: 9m 9s, 100 more iterations: 1h 31m 36s, 500 more iterations: 7h 38m 4s.
[2025-08-20 10:57:53,640][__main__][INFO] - Starting iteration 177.
[2025-08-20 10:58:16,751][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 10:58:16,752][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 10:58:16,758][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 10:58:19,232][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 10:58:19,233][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 10:58:19,240][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 10:58:19,242][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data.
[2025-08-20 10:58:19,242][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets.
[2025-08-20 10:58:19,542][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:58:20,331][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:58:21,126][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:58:21,919][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:58:22,711][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:58:23,506][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:58:24,301][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:58:25,093][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:58:25,887][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:58:26,682][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:58:27,476][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:58:28,270][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:58:29,067][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:58:29,863][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:58:30,658][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:58:31,451][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:58:32,247][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:58:33,044][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:58:34,295][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:58:35,089][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:58:35,885][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:58:36,683][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:58:37,480][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:58:38,274][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:58:39,069][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:58:39,865][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:58:40,662][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:58:41,461][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:58:42,258][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:58:43,052][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:58:43,848][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:58:44,644][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:58:46,249][mllm.training.trainer_common][INFO] - For task: Apply reinforce step
,
            ΔVRAM Allocated: 2.288818359375e-05 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:26,
            Percentage of VRAM taken: 69.91579512747062%,
            
[2025-08-20 10:58:47,234][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/policy_optimizer_state.pt
[2025-08-20 10:58:47,235][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/critic_optimizer_state.pt
[2025-08-20 10:58:48,575][__main__][INFO] - Iteration 178 took 54s (37.59% Gen, 62.41% Train). Generation: 20s, Training: 34s. Estimated remaining time: 12h 27m 17s. Estimated total time: 15h 15m 35s. Time estimates for 10 more iterations: 9m 9s, 100 more iterations: 1h 31m 33s, 500 more iterations: 7h 37m 47s.
[2025-08-20 10:58:48,577][__main__][INFO] - Starting iteration 178.
[2025-08-20 10:59:12,003][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 10:59:12,005][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 10:59:12,011][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 10:59:14,462][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 10:59:14,463][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 10:59:14,470][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 10:59:14,472][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data.
[2025-08-20 10:59:14,472][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets.
[2025-08-20 10:59:14,770][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:59:15,560][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:59:16,350][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:59:17,142][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:59:17,933][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:59:18,727][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:59:19,519][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:59:20,310][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:59:21,105][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:59:21,897][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:59:22,690][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:59:23,485][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:59:24,278][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:59:25,070][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:59:25,867][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:59:26,662][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:59:27,455][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:59:28,250][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:59:29,047][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:59:29,843][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:59:30,637][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:59:31,431][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:59:32,227][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:59:33,025][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:59:33,821][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:59:34,615][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:59:35,409][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:59:36,206][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:59:37,005][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:59:38,340][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:59:39,135][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:59:39,930][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 10:59:41,537][mllm.training.trainer_common][INFO] - For task: Apply reinforce step
,
            ΔVRAM Allocated: 2.288818359375e-05 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:27,
            Percentage of VRAM taken: 69.91579512747062%,
            
[2025-08-20 10:59:42,539][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/policy_optimizer_state.pt
[2025-08-20 10:59:42,542][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/critic_optimizer_state.pt
[2025-08-20 10:59:44,391][__main__][INFO] - Iteration 179 took 55s (37.57% Gen, 62.42% Train). Generation: 20s, Training: 34s. Estimated remaining time: 12h 41m 0s. Estimated total time: 15h 30m 13s. Time estimates for 10 more iterations: 9m 18s, 100 more iterations: 1h 33m 1s, 500 more iterations: 7h 45m 6s.
[2025-08-20 10:59:44,392][__main__][INFO] - Starting iteration 179.
[2025-08-20 11:00:07,524][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 11:00:07,525][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 11:00:07,531][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 11:00:09,988][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 11:00:09,989][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 11:00:09,995][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 11:00:09,997][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data.
[2025-08-20 11:00:09,998][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets.
[2025-08-20 11:00:10,296][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:00:11,089][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:00:11,881][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:00:12,674][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:00:13,469][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:00:14,262][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:00:15,055][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:00:15,851][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:00:16,645][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:00:17,438][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:00:18,234][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:00:19,031][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:00:19,827][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:00:20,621][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:00:21,415][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:00:22,212][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:00:23,010][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:00:23,807][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:00:24,603][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:00:25,398][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:00:26,194][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:00:27,421][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:00:28,214][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:00:29,008][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:00:29,804][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:00:30,598][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:00:31,392][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:00:32,188][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:00:32,985][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:00:33,780][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:00:34,574][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:00:35,368][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:00:36,977][mllm.training.trainer_common][INFO] - For task: Apply reinforce step
,
            ΔVRAM Allocated: 2.288818359375e-05 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:26,
            Percentage of VRAM taken: 69.91579512747062%,
            
[2025-08-20 11:00:37,976][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/policy_optimizer_state.pt
[2025-08-20 11:00:37,978][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/critic_optimizer_state.pt
[2025-08-20 11:00:39,286][__main__][INFO] - Iteration 180 took 54s (37.68% Gen, 62.32% Train). Generation: 20s, Training: 34s. Estimated remaining time: 12h 24m 44s. Estimated total time: 15h 14m 52s. Time estimates for 10 more iterations: 9m 8s, 100 more iterations: 1h 31m 29s, 500 more iterations: 7h 37m 26s.
[2025-08-20 11:00:39,287][__main__][INFO] - Starting iteration 180.
[2025-08-20 11:01:02,424][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 11:01:02,425][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 11:01:02,431][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 11:01:04,903][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 11:01:04,904][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 11:01:04,910][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 11:01:04,913][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data.
[2025-08-20 11:01:04,913][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets.
[2025-08-20 11:01:05,212][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:01:06,003][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:01:06,796][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:01:07,591][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:01:08,384][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:01:09,178][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:01:09,974][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:01:10,769][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:01:11,562][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:01:12,358][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:01:13,155][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:01:13,952][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:01:14,746][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:01:15,539][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:01:16,336][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:01:17,134][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:01:17,932][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:01:18,729][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:01:19,524][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:01:20,319][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:01:21,116][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:01:21,913][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:01:22,712][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:01:23,971][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:01:24,768][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:01:25,563][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:01:26,359][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:01:27,154][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:01:27,952][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:01:28,750][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:01:29,549][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:01:30,348][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:01:31,968][mllm.training.trainer_common][INFO] - For task: Apply reinforce step
,
            ΔVRAM Allocated: 2.288818359375e-05 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:27,
            Percentage of VRAM taken: 69.91579512747062%,
            
[2025-08-20 11:01:32,936][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/policy_optimizer_state.pt
[2025-08-20 11:01:32,937][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/critic_optimizer_state.pt
[2025-08-20 11:01:34,251][__main__][INFO] - Iteration 181 took 54s (37.63% Gen, 62.36% Train). Generation: 20s, Training: 34s. Estimated remaining time: 12h 24m 59s. Estimated total time: 15h 16m 2s. Time estimates for 10 more iterations: 9m 9s, 100 more iterations: 1h 31m 36s, 500 more iterations: 7h 38m 1s.
[2025-08-20 11:01:34,252][__main__][INFO] - Starting iteration 181.
[2025-08-20 11:01:57,474][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 11:01:57,475][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 11:01:57,481][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 11:01:59,960][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 11:01:59,961][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 11:01:59,967][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 11:01:59,970][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data.
[2025-08-20 11:01:59,970][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets.
[2025-08-20 11:02:00,268][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:02:01,059][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:02:01,852][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:02:02,644][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:02:03,439][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:02:04,233][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:02:05,026][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:02:05,819][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:02:06,616][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:02:07,410][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:02:08,204][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:02:08,998][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:02:09,796][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:02:10,593][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:02:11,388][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:02:12,182][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:02:12,978][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:02:13,776][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:02:15,054][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:02:15,850][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:02:16,644][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:02:17,438][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:02:18,236][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:02:19,034][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:02:19,832][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:02:20,629][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:02:21,425][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:02:22,220][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:02:23,016][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:02:23,814][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:02:24,612][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:02:28,756][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:02:30,322][mllm.training.trainer_common][INFO] - For task: Apply reinforce step
,
            ΔVRAM Allocated: 2.288818359375e-05 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:30,
            Percentage of VRAM taken: 69.91579512747062%,
            
[2025-08-20 11:02:31,248][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/policy_optimizer_state.pt
[2025-08-20 11:02:31,249][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/critic_optimizer_state.pt
[2025-08-20 11:02:36,217][__main__][INFO] - Iteration 182 took 1m 1s (33.51% Gen, 66.49% Train). Generation: 20s, Training: 41s. Estimated remaining time: 14h 20m 38s. Estimated total time: 17h 12m 44s. Time estimates for 10 more iterations: 10m 19s, 100 more iterations: 1h 43m 16s, 500 more iterations: 8h 36m 22s.
[2025-08-20 11:02:36,218][__main__][INFO] - Starting iteration 182.
[2025-08-20 11:02:59,254][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 11:02:59,255][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 11:02:59,261][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 11:03:01,695][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 11:03:01,696][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 11:03:01,702][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 11:03:01,705][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data.
[2025-08-20 11:03:01,705][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets.
[2025-08-20 11:03:02,003][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:03:02,793][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:03:03,583][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:03:04,375][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:03:05,166][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:03:05,958][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:03:06,751][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:03:07,541][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:03:08,335][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:03:09,128][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:03:09,919][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:03:10,715][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:03:11,508][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:03:12,301][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:03:13,097][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:03:13,892][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:03:14,686][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:03:15,479][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:03:16,276][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:03:17,072][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:03:18,300][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:03:19,096][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:03:19,894][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:03:20,691][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:03:21,485][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:03:22,279][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:03:23,076][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:03:23,874][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:03:24,672][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:03:25,467][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:03:26,262][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:03:27,057][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:03:28,677][mllm.training.trainer_common][INFO] - For task: Apply reinforce step
,
            ΔVRAM Allocated: 2.288818359375e-05 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:26,
            Percentage of VRAM taken: 69.91579512747062%,
            
[2025-08-20 11:03:29,665][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/policy_optimizer_state.pt
[2025-08-20 11:03:29,667][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/critic_optimizer_state.pt
[2025-08-20 11:03:30,986][__main__][INFO] - Iteration 183 took 54s (37.60% Gen, 62.40% Train). Generation: 20s, Training: 34s. Estimated remaining time: 12h 19m 47s. Estimated total time: 15h 12m 47s. Time estimates for 10 more iterations: 9m 7s, 100 more iterations: 1h 31m 16s, 500 more iterations: 7h 36m 23s.
[2025-08-20 11:03:30,988][__main__][INFO] - Starting iteration 183.
[2025-08-20 11:03:54,408][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 11:03:54,409][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 11:03:54,415][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 11:03:56,860][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 11:03:56,861][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 11:03:56,867][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 11:03:56,869][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data.
[2025-08-20 11:03:56,870][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets.
[2025-08-20 11:03:57,169][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:03:57,960][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:03:58,753][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:03:59,546][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:04:00,341][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:04:01,135][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:04:01,928][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:04:02,722][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:04:03,518][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:04:04,311][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:04:05,104][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:04:05,900][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:04:06,696][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:04:07,490][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:04:08,285][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:04:09,081][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:04:09,879][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:04:10,677][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:04:11,966][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:04:12,760][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:04:13,559][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:04:14,356][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:04:15,153][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:04:15,948][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:04:16,742][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:04:17,538][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:04:18,336][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:04:19,132][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:04:19,927][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:04:20,722][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:04:21,518][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:04:22,316][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:04:23,944][mllm.training.trainer_common][INFO] - For task: Apply reinforce step
,
            ΔVRAM Allocated: 2.288818359375e-05 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:27,
            Percentage of VRAM taken: 69.91579512747062%,
            
[2025-08-20 11:04:24,892][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/policy_optimizer_state.pt
[2025-08-20 11:04:24,893][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/critic_optimizer_state.pt
[2025-08-20 11:04:27,167][__main__][INFO] - Iteration 184 took 56s (37.34% Gen, 62.66% Train). Generation: 20s, Training: 35s. Estimated remaining time: 12h 42m 22s. Estimated total time: 15h 36m 18s. Time estimates for 10 more iterations: 9m 21s, 100 more iterations: 1h 33m 37s, 500 more iterations: 7h 48m 9s.
[2025-08-20 11:04:27,168][__main__][INFO] - Starting iteration 184.
[2025-08-20 11:04:50,372][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 11:04:50,373][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 11:04:50,380][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 11:04:52,858][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 11:04:52,859][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 11:04:52,866][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 11:04:52,868][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data.
[2025-08-20 11:04:52,868][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets.
[2025-08-20 11:04:53,167][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:04:53,958][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:04:54,748][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:04:55,542][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:04:56,333][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:04:57,126][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:04:57,922][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:04:58,716][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:04:59,508][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:05:00,303][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:05:01,099][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:05:01,893][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:05:02,686][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:05:03,482][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:05:04,278][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:05:05,072][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:05:05,866][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:05:06,662][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:05:07,459][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:05:08,254][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:05:09,048][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:05:10,448][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:05:11,242][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:05:12,038][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:05:12,833][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:05:13,626][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:05:14,420][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:05:15,217][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:05:16,011][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:05:16,805][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:05:17,600][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:05:18,398][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:05:20,020][mllm.training.trainer_common][INFO] - For task: Apply reinforce step
,
            ΔVRAM Allocated: 2.288818359375e-05 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:27,
            Percentage of VRAM taken: 69.91579512747062%,
            
[2025-08-20 11:05:21,054][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/policy_optimizer_state.pt
[2025-08-20 11:05:21,056][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/critic_optimizer_state.pt
[2025-08-20 11:05:30,746][__main__][INFO] - Iteration 185 took 1m 3s (32.60% Gen, 67.40% Train). Generation: 20s, Training: 42s. Estimated remaining time: 14h 44m 37s. Estimated total time: 17h 39m 37s. Time estimates for 10 more iterations: 10m 35s, 100 more iterations: 1h 45m 57s, 500 more iterations: 8h 49m 48s.
[2025-08-20 11:05:30,748][__main__][INFO] - Starting iteration 185.
[2025-08-20 11:05:54,013][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 11:05:54,014][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 11:05:54,020][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 11:05:56,489][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 11:05:56,491][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 11:05:56,497][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 11:05:56,500][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data.
[2025-08-20 11:05:56,500][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets.
[2025-08-20 11:05:56,798][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:05:57,588][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:05:58,383][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:05:59,176][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:05:59,968][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:06:00,764][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:06:01,560][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:06:02,353][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:06:03,147][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:06:03,943][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:06:04,739][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:06:05,533][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:06:06,327][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:06:07,123][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:06:07,920][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:06:08,716][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:06:09,509][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:06:10,304][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:06:11,101][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:06:11,898][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:06:12,692][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:06:13,486][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:06:14,282][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:06:15,080][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:06:15,876][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:06:16,670][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:06:17,462][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:06:18,691][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:06:19,484][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:06:20,279][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:06:21,075][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:06:21,868][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:06:23,427][mllm.training.trainer_common][INFO] - For task: Apply reinforce step
,
            ΔVRAM Allocated: 2.288818359375e-05 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:26,
            Percentage of VRAM taken: 69.91579512747062%,
            
[2025-08-20 11:06:24,401][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/policy_optimizer_state.pt
[2025-08-20 11:06:24,402][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/critic_optimizer_state.pt
[2025-08-20 11:06:25,740][__main__][INFO] - Iteration 186 took 54s (37.81% Gen, 62.19% Train). Generation: 20s, Training: 34s. Estimated remaining time: 12h 20m 37s. Estimated total time: 15h 16m 31s. Time estimates for 10 more iterations: 9m 9s, 100 more iterations: 1h 31m 39s, 500 more iterations: 7h 38m 15s.
[2025-08-20 11:06:25,742][__main__][INFO] - Starting iteration 186.
[2025-08-20 11:06:48,944][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 11:06:48,946][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 11:06:48,952][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 11:06:51,421][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 11:06:51,423][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 11:06:51,430][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 11:06:51,432][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data.
[2025-08-20 11:06:51,432][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets.
[2025-08-20 11:06:51,733][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:06:52,525][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:06:53,316][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:06:54,108][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:06:54,905][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:06:55,698][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:06:56,490][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:06:57,286][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:06:58,080][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:06:58,873][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:06:59,669][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:07:00,465][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:07:01,261][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:07:02,054][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:07:02,848][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:07:03,645][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:07:04,440][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:07:05,233][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:07:06,495][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:07:07,288][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:07:08,085][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:07:08,884][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:07:09,682][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:07:10,477][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:07:11,270][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:07:12,064][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:07:12,861][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:07:13,656][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:07:14,450][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:07:15,244][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:07:16,041][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:07:16,838][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:07:18,397][mllm.training.trainer_common][INFO] - For task: Apply reinforce step
,
            ΔVRAM Allocated: 2.288818359375e-05 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:26,
            Percentage of VRAM taken: 69.91579512747062%,
            
[2025-08-20 11:07:19,333][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/policy_optimizer_state.pt
[2025-08-20 11:07:19,335][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/critic_optimizer_state.pt
[2025-08-20 11:07:21,513][__main__][INFO] - Iteration 187 took 55s (37.22% Gen, 62.78% Train). Generation: 20s, Training: 35s. Estimated remaining time: 12h 32m 40s. Estimated total time: 15h 29m 31s. Time estimates for 10 more iterations: 9m 17s, 100 more iterations: 1h 32m 57s, 500 more iterations: 7h 44m 45s.
[2025-08-20 11:07:21,515][__main__][INFO] - Starting iteration 187.
[2025-08-20 11:07:44,737][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 11:07:44,739][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 11:07:44,746][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 11:07:47,178][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 11:07:47,180][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 11:07:47,186][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 11:07:47,189][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data.
[2025-08-20 11:07:47,189][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets.
[2025-08-20 11:07:47,488][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:07:48,279][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:07:49,072][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:07:49,869][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:07:50,664][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:07:51,458][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:07:52,252][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:07:53,049][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:07:53,846][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:07:54,640][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:07:55,434][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:07:56,231][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:07:57,028][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:07:57,825][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:07:58,620][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:07:59,415][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:08:00,211][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:08:01,008][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:08:01,808][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:08:03,170][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:08:03,967][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:08:04,765][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:08:05,561][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:08:06,354][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:08:07,148][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:08:07,945][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:08:08,740][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:08:09,533][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:08:10,327][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:08:11,124][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:08:11,921][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:08:12,716][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:08:14,302][mllm.training.trainer_common][INFO] - For task: Apply reinforce step
,
            ΔVRAM Allocated: 2.288818359375e-05 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:27,
            Percentage of VRAM taken: 69.91579512747062%,
            
[2025-08-20 11:08:15,251][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/policy_optimizer_state.pt
[2025-08-20 11:08:15,253][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/critic_optimizer_state.pt
[2025-08-20 11:08:16,569][__main__][INFO] - Iteration 188 took 55s (37.77% Gen, 62.23% Train). Generation: 20s, Training: 34s. Estimated remaining time: 12h 19m 48s. Estimated total time: 15h 17m 33s. Time estimates for 10 more iterations: 9m 10s, 100 more iterations: 1h 31m 45s, 500 more iterations: 7h 38m 46s.
[2025-08-20 11:08:16,571][__main__][INFO] - Starting iteration 188.
[2025-08-20 11:08:40,610][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 11:08:40,611][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 11:08:40,617][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 11:08:43,085][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 11:08:43,086][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 11:08:43,092][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 11:08:43,094][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data.
[2025-08-20 11:08:43,095][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets.
[2025-08-20 11:08:43,394][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:08:44,187][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:08:44,980][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:08:45,774][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:08:46,570][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:08:47,364][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:08:48,157][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:08:48,953][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:08:49,750][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:08:50,546][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:08:51,340][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:08:52,134][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:08:52,932][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:08:54,427][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:08:55,221][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:08:56,014][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:08:56,811][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:08:57,609][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:08:58,405][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:08:59,647][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:09:00,440][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:09:01,232][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:09:02,028][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:09:02,824][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:09:03,617][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:09:04,410][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:09:05,207][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:09:06,003][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:09:06,797][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:09:07,591][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:09:08,387][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:09:09,183][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:09:10,742][mllm.training.trainer_common][INFO] - For task: Apply reinforce step
,
            ΔVRAM Allocated: 2.288818359375e-05 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:27,
            Percentage of VRAM taken: 69.91579512747062%,
            
[2025-08-20 11:09:11,684][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/policy_optimizer_state.pt
[2025-08-20 11:09:11,687][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/critic_optimizer_state.pt
[2025-08-20 11:09:13,006][__main__][INFO] - Iteration 189 took 56s (38.22% Gen, 61.78% Train). Generation: 21s, Training: 34s. Estimated remaining time: 12h 41m 53s. Estimated total time: 15h 40m 35s. Time estimates for 10 more iterations: 9m 24s, 100 more iterations: 1h 34m 3s, 500 more iterations: 7h 50m 17s.
[2025-08-20 11:09:13,008][__main__][INFO] - Starting iteration 189.
[2025-08-20 11:09:36,484][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 11:09:36,485][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 11:09:36,492][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 11:09:38,953][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 11:09:38,955][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 11:09:38,961][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 11:09:38,964][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data.
[2025-08-20 11:09:38,964][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets.
[2025-08-20 11:09:39,265][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:09:40,056][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:09:40,851][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:09:41,644][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:09:42,438][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:09:43,234][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:09:44,030][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:09:44,823][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:09:45,618][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:09:46,415][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:09:47,212][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:09:48,007][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:09:48,801][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:09:49,595][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:09:50,393][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:09:51,189][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:09:51,984][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:09:52,778][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:09:53,575][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:09:54,372][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:09:55,169][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:09:55,962][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:09:56,755][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:09:58,078][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:09:58,872][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:09:59,669][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:10:00,464][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:10:01,258][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:10:02,052][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:10:02,849][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:10:03,644][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:10:04,438][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:10:06,012][mllm.training.trainer_common][INFO] - For task: Apply reinforce step
,
            ΔVRAM Allocated: 2.288818359375e-05 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:27,
            Percentage of VRAM taken: 69.91579512747062%,
            
[2025-08-20 11:10:06,948][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/policy_optimizer_state.pt
[2025-08-20 11:10:06,949][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/critic_optimizer_state.pt
[2025-08-20 11:10:08,500][__main__][INFO] - Iteration 190 took 55s (37.88% Gen, 62.11% Train). Generation: 21s, Training: 34s. Estimated remaining time: 12h 25m 13s. Estimated total time: 15h 24m 51s. Time estimates for 10 more iterations: 9m 14s, 100 more iterations: 1h 32m 29s, 500 more iterations: 7h 42m 25s.
[2025-08-20 11:10:08,501][__main__][INFO] - Starting iteration 190.
[2025-08-20 11:10:31,802][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 11:10:31,804][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 11:10:31,810][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 11:10:34,295][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 11:10:34,297][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 11:10:34,303][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 11:10:34,305][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data.
[2025-08-20 11:10:34,306][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets.
[2025-08-20 11:10:34,605][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:10:35,397][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:10:36,191][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:10:36,985][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:10:37,780][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:10:38,577][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:10:39,373][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:10:40,167][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:10:40,961][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:10:41,757][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:10:42,555][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:10:43,352][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:10:44,146][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:10:44,941][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:10:45,738][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:10:46,535][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:10:47,333][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:10:48,600][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:10:49,393][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:10:50,186][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:10:50,978][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:10:51,774][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:10:52,568][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:10:53,360][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:10:54,154][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:10:54,950][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:10:55,744][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:10:56,537][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:10:57,333][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:10:58,129][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:10:58,924][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:10:59,718][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:11:01,304][mllm.training.trainer_common][INFO] - For task: Apply reinforce step
,
            ΔVRAM Allocated: 2.288818359375e-05 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:26,
            Percentage of VRAM taken: 69.91579512747062%,
            
[2025-08-20 11:11:02,298][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/policy_optimizer_state.pt
[2025-08-20 11:11:02,299][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/critic_optimizer_state.pt
[2025-08-20 11:11:03,651][__main__][INFO] - Iteration 191 took 55s (37.79% Gen, 62.20% Train). Generation: 20s, Training: 34s. Estimated remaining time: 12h 18m 36s. Estimated total time: 15h 19m 9s. Time estimates for 10 more iterations: 9m 11s, 100 more iterations: 1h 31m 54s, 500 more iterations: 7h 39m 34s.
[2025-08-20 11:11:03,652][__main__][INFO] - Starting iteration 191.
[2025-08-20 11:11:26,855][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 11:11:26,856][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 11:11:26,862][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 11:11:29,309][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 11:11:29,310][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 11:11:29,317][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 11:11:29,319][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data.
[2025-08-20 11:11:29,320][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets.
[2025-08-20 11:11:29,618][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:11:30,410][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:11:31,203][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:11:32,000][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:11:32,797][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:11:33,592][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:11:34,387][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:11:35,183][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:11:35,981][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:11:36,779][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:11:37,575][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:11:38,370][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:11:39,166][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:11:39,962][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:11:40,760][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:11:41,558][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:11:42,355][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:11:43,150][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:11:43,944][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:11:44,739][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:11:45,536][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:11:46,331][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:11:47,622][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:11:48,415][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:11:49,210][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:11:50,002][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:11:50,796][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:11:51,592][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:11:52,384][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:11:53,177][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:11:53,973][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:11:54,767][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:11:56,320][mllm.training.trainer_common][INFO] - For task: Apply reinforce step
,
            ΔVRAM Allocated: 2.288818359375e-05 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:26,
            Percentage of VRAM taken: 69.91579512747062%,
            
[2025-08-20 11:11:57,245][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/policy_optimizer_state.pt
[2025-08-20 11:11:57,247][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/critic_optimizer_state.pt
[2025-08-20 11:11:59,428][__main__][INFO] - Iteration 192 took 55s (37.22% Gen, 62.78% Train). Generation: 20s, Training: 35s. Estimated remaining time: 12h 28m 6s. Estimated total time: 15h 29m 35s. Time estimates for 10 more iterations: 9m 17s, 100 more iterations: 1h 32m 57s, 500 more iterations: 7h 44m 47s.
[2025-08-20 11:11:59,429][__main__][INFO] - Starting iteration 192.
[2025-08-20 11:12:22,808][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 11:12:22,809][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 11:12:22,815][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 11:12:25,311][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 11:12:25,313][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 11:12:25,319][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 11:12:25,322][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data.
[2025-08-20 11:12:25,322][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets.
[2025-08-20 11:12:25,622][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:12:26,414][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:12:27,207][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:12:28,003][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:12:28,800][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:12:29,594][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:12:30,388][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:12:31,184][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:12:31,982][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:12:32,779][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:12:33,574][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:12:34,368][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:12:35,164][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:12:35,962][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:12:36,760][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:12:37,556][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:12:38,349][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:12:39,143][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:12:39,940][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:12:40,735][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:12:42,024][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:12:42,820][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:12:43,616][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:12:44,410][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:12:45,205][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:12:46,000][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:12:46,794][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:12:47,588][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:12:48,382][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:12:49,177][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:12:49,973][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:12:50,767][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:12:52,316][mllm.training.trainer_common][INFO] - For task: Apply reinforce step
,
            ΔVRAM Allocated: 2.288818359375e-05 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:26,
            Percentage of VRAM taken: 69.91579512747062%,
            
[2025-08-20 11:12:53,649][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/policy_optimizer_state.pt
[2025-08-20 11:12:54,580][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/critic_optimizer_state.pt
[2025-08-20 11:12:55,924][__main__][INFO] - Iteration 193 took 56s (36.94% Gen, 63.06% Train). Generation: 20s, Training: 35s. Estimated remaining time: 12h 39m 8s. Estimated total time: 15h 41m 33s. Time estimates for 10 more iterations: 9m 24s, 100 more iterations: 1h 34m 9s, 500 more iterations: 7h 50m 46s.
[2025-08-20 11:12:57,102][__main__][INFO] - Starting iteration 193.
[2025-08-20 11:13:20,666][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 11:13:20,667][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 11:13:20,674][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 11:13:23,144][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 11:13:23,146][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 11:13:23,152][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 11:13:23,155][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data.
[2025-08-20 11:13:23,155][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets.
[2025-08-20 11:13:23,453][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:13:24,245][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:13:25,040][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:13:25,832][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:13:26,626][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:13:27,423][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:13:28,219][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:13:29,011][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:13:29,806][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:13:30,603][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:13:31,399][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:13:32,193][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:13:32,986][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:13:33,782][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:13:34,576][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:13:35,368][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:13:36,163][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:13:36,958][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:13:37,751][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:13:38,545][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:13:39,341][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:13:40,137][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:13:40,933][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:13:42,195][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:13:42,988][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:13:43,781][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:13:44,577][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:13:45,370][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:13:46,162][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:13:46,959][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:13:47,754][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:13:48,546][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:13:50,138][mllm.training.trainer_common][INFO] - For task: Apply reinforce step
,
            ΔVRAM Allocated: 2.288818359375e-05 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:26,
            Percentage of VRAM taken: 69.91579512747062%,
            
[2025-08-20 11:13:51,073][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/policy_optimizer_state.pt
[2025-08-20 11:13:51,075][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/critic_optimizer_state.pt
[2025-08-20 11:13:52,478][__main__][INFO] - Iteration 194 took 55s (38.10% Gen, 61.90% Train). Generation: 21s, Training: 34s. Estimated remaining time: 12h 19m 32s. Estimated total time: 15h 22m 53s. Time estimates for 10 more iterations: 9m 13s, 100 more iterations: 1h 32m 17s, 500 more iterations: 7h 41m 26s.
[2025-08-20 11:13:52,479][__main__][INFO] - Starting iteration 194.
[2025-08-20 11:14:16,181][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 11:14:16,182][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 11:14:16,189][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 11:14:18,649][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 11:14:18,650][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 11:14:18,657][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 11:14:18,659][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data.
[2025-08-20 11:14:18,660][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets.
[2025-08-20 11:14:18,958][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:14:19,749][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:14:20,544][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:14:21,337][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:14:22,132][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:14:22,929][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:14:23,726][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:14:24,522][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:14:25,316][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:14:26,110][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:14:26,908][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:14:27,701][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:14:28,494][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:14:29,288][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:14:30,084][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:14:30,878][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:14:31,671][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:14:32,466][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:14:33,264][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:14:34,059][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:14:34,853][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:14:35,646][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:14:36,443][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:14:37,239][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:14:38,439][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:14:39,233][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:14:40,027][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:14:40,823][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:14:41,621][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:14:42,418][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:14:43,213][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:14:44,007][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:14:45,561][mllm.training.trainer_common][INFO] - For task: Apply reinforce step
,
            ΔVRAM Allocated: 2.288818359375e-05 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:26,
            Percentage of VRAM taken: 69.91579512747062%,
            
[2025-08-20 11:14:46,490][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/policy_optimizer_state.pt
[2025-08-20 11:14:46,491][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/critic_optimizer_state.pt
[2025-08-20 11:14:47,850][__main__][INFO] - Iteration 195 took 55s (38.38% Gen, 61.62% Train). Generation: 21s, Training: 34s. Estimated remaining time: 12h 18m 33s. Estimated total time: 15h 22m 49s. Time estimates for 10 more iterations: 9m 13s, 100 more iterations: 1h 32m 16s, 500 more iterations: 7h 41m 24s.
[2025-08-20 11:14:47,851][__main__][INFO] - Starting iteration 195.
[2025-08-20 11:15:13,508][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 11:15:15,051][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 11:15:15,059][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 11:15:17,530][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 11:15:17,531][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 11:15:17,538][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 11:15:17,540][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data.
[2025-08-20 11:15:17,540][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets.
[2025-08-20 11:15:17,845][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:15:18,634][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:15:19,429][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:15:25,082][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:15:26,088][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:15:26,881][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:15:27,671][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:15:28,464][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:15:29,255][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:15:30,046][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:15:30,839][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:15:31,637][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:15:32,429][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:15:33,224][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:15:34,475][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:15:35,754][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:15:37,370][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:15:38,977][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:15:39,768][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:15:40,562][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:15:41,354][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:15:42,146][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:15:42,941][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:15:43,735][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:15:44,527][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:15:45,321][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:15:46,667][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:15:47,461][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:15:48,255][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:15:49,048][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:15:49,841][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:15:50,637][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:15:52,205][mllm.training.trainer_common][INFO] - For task: Apply reinforce step
,
            ΔVRAM Allocated: 2.288818359375e-05 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:34,
            Percentage of VRAM taken: 69.91579512747062%,
            
[2025-08-20 11:15:53,155][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/policy_optimizer_state.pt
[2025-08-20 11:15:53,156][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/critic_optimizer_state.pt
[2025-08-20 11:15:54,634][__main__][INFO] - Iteration 196 took 1m 6s (34.71% Gen, 65.28% Train). Generation: 23s, Training: 43s. Estimated remaining time: 15h 27m 39s. Estimated total time: 18h 33m 2s. Time estimates for 10 more iterations: 11m 7s, 100 more iterations: 1h 51m 18s, 500 more iterations: 9h 16m 31s.
[2025-08-20 11:15:54,636][__main__][INFO] - Starting iteration 196.
[2025-08-20 11:16:18,359][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 11:16:18,361][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 11:16:18,367][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 11:16:20,827][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 11:16:20,829][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 11:16:20,836][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 11:16:20,838][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data.
[2025-08-20 11:16:20,838][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets.
[2025-08-20 11:16:21,141][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:16:21,930][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:16:22,721][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:16:23,513][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:16:24,307][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:16:25,099][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:16:25,891][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:16:26,687][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:16:27,479][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:16:28,272][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:16:29,067][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:16:29,862][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:16:30,655][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:16:31,448][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:16:32,244][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:16:33,036][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:16:33,830][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:16:34,626][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:16:35,421][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:16:36,214][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:16:37,466][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:16:38,263][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:16:39,055][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:16:39,848][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:16:40,644][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:16:41,440][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:16:42,233][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:16:43,027][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:16:43,825][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:16:44,622][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:16:45,417][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:16:46,211][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:16:47,813][mllm.training.trainer_common][INFO] - For task: Apply reinforce step
,
            ΔVRAM Allocated: 2.288818359375e-05 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:26,
            Percentage of VRAM taken: 69.91579512747062%,
            
[2025-08-20 11:16:48,751][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/policy_optimizer_state.pt
[2025-08-20 11:16:48,753][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/critic_optimizer_state.pt
[2025-08-20 11:16:54,635][__main__][INFO] - Iteration 197 took 59s (35.44% Gen, 64.56% Train). Generation: 21s, Training: 38s. Estimated remaining time: 13h 33m 35s. Estimated total time: 16h 39m 59s. Time estimates for 10 more iterations: 9m 59s, 100 more iterations: 1h 39m 59s, 500 more iterations: 8h 19m 59s.
[2025-08-20 11:16:54,637][__main__][INFO] - Starting iteration 197.
[2025-08-20 11:17:18,099][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 11:17:18,100][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 11:17:18,106][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 11:17:20,551][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 11:17:20,552][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 11:17:20,559][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 11:17:20,561][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data.
[2025-08-20 11:17:20,562][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets.
[2025-08-20 11:17:20,860][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:17:21,650][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:17:22,440][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:17:23,232][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:17:24,025][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:17:24,817][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:17:25,611][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:17:26,407][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:17:27,200][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:17:27,992][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:17:28,788][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:17:29,584][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:17:30,376][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:17:31,170][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:17:31,966][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:17:32,759][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:17:33,551][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:17:34,347][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:17:35,141][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:17:35,934][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:17:36,729][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:17:37,526][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:17:38,321][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:17:39,528][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:17:40,321][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:17:41,112][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:17:41,905][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:17:42,698][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:17:43,490][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:17:44,285][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:17:45,078][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:17:45,870][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:17:47,428][mllm.training.trainer_common][INFO] - For task: Apply reinforce step
,
            ΔVRAM Allocated: 2.288818359375e-05 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:26,
            Percentage of VRAM taken: 69.91579512747062%,
            
[2025-08-20 11:17:48,387][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/policy_optimizer_state.pt
[2025-08-20 11:17:48,389][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/critic_optimizer_state.pt
[2025-08-20 11:17:49,759][__main__][INFO] - Iteration 198 took 55s (38.13% Gen, 61.86% Train). Generation: 21s, Training: 34s. Estimated remaining time: 12h 11m 23s. Estimated total time: 15h 18m 42s. Time estimates for 10 more iterations: 9m 11s, 100 more iterations: 1h 31m 52s, 500 more iterations: 7h 39m 21s.
[2025-08-20 11:17:49,761][__main__][INFO] - Starting iteration 198.
[2025-08-20 11:18:13,437][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 11:18:13,438][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 11:18:13,445][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 11:18:15,906][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 11:18:15,907][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 11:18:15,914][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 11:18:15,916][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data.
[2025-08-20 11:18:15,916][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets.
[2025-08-20 11:18:16,216][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:18:17,005][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:18:17,795][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:18:18,587][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:18:19,377][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:18:20,169][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:18:20,960][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:18:21,753][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:18:22,545][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:18:23,336][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:18:24,131][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:18:24,923][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:18:25,715][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:18:26,509][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:18:27,301][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:18:28,094][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:18:28,889][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:18:29,682][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:18:30,474][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:18:31,269][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:18:32,062][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:18:33,379][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:18:34,174][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:18:34,968][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:18:35,763][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:18:36,557][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:18:37,351][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:18:38,148][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:18:38,945][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:18:39,740][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:18:40,533][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:18:41,328][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:18:42,901][mllm.training.trainer_common][INFO] - For task: Apply reinforce step
,
            ΔVRAM Allocated: 2.288818359375e-05 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:26,
            Percentage of VRAM taken: 69.91579512747062%,
            
[2025-08-20 11:18:43,849][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/policy_optimizer_state.pt
[2025-08-20 11:18:43,850][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/critic_optimizer_state.pt
[2025-08-20 11:18:45,488][__main__][INFO] - Iteration 199 took 55s (38.08% Gen, 61.92% Train). Generation: 21s, Training: 34s. Estimated remaining time: 12h 20m 31s. Estimated total time: 15h 28m 46s. Time estimates for 10 more iterations: 9m 17s, 100 more iterations: 1h 32m 52s, 500 more iterations: 7h 44m 23s.
[2025-08-20 11:18:45,489][__main__][INFO] - Starting iteration 199.
[2025-08-20 11:19:08,850][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 11:19:08,851][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 11:19:08,857][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 11:19:11,312][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 11:19:11,313][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 11:19:11,319][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 11:19:11,321][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data.
[2025-08-20 11:19:11,322][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets.
[2025-08-20 11:19:11,620][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:19:12,411][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:19:13,201][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:19:13,995][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:19:14,787][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:19:15,579][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:19:16,375][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:19:17,169][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:19:17,961][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:19:18,756][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:19:19,550][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:19:20,342][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:19:21,137][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:19:21,933][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:19:22,728][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:19:23,521][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:19:24,315][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:19:25,110][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:19:25,903][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:19:26,696][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:19:27,942][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:19:28,734][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:19:29,531][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:19:30,326][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:19:31,119][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:19:31,912][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:19:32,708][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:19:33,500][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:19:34,293][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:19:35,089][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:19:35,883][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:19:36,676][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:19:38,265][mllm.training.trainer_common][INFO] - For task: Apply reinforce step
,
            ΔVRAM Allocated: 2.288818359375e-05 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:26,
            Percentage of VRAM taken: 69.91579512747062%,
            
[2025-08-20 11:19:39,237][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/policy_optimizer_state.pt
[2025-08-20 11:19:39,239][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/critic_optimizer_state.pt
[2025-08-20 11:19:40,726][__main__][INFO] - Iteration 200 took 55s (37.86% Gen, 62.14% Train). Generation: 20s, Training: 34s. Estimated remaining time: 12h 11m 27s. Estimated total time: 15h 20m 36s. Time estimates for 10 more iterations: 9m 12s, 100 more iterations: 1h 32m 3s, 500 more iterations: 7h 40m 18s.
[2025-08-20 11:19:40,728][__main__][INFO] - Starting iteration 200.
[2025-08-20 11:20:04,196][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 11:20:04,197][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 11:20:04,203][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 11:20:06,654][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 11:20:06,655][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 11:20:06,661][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 11:20:06,663][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data.
[2025-08-20 11:20:06,664][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets.
[2025-08-20 11:20:06,963][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:20:07,756][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:20:08,547][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:20:09,341][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:20:10,137][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:20:10,930][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:20:11,723][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:20:12,518][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:20:13,314][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:20:14,107][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:20:14,901][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:20:15,697][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:20:16,492][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:20:17,286][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:20:18,080][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:20:18,877][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:20:19,673][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:20:20,467][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:20:21,261][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:20:22,056][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:20:22,854][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:20:23,649][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:20:24,443][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:20:25,237][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:20:26,449][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:20:27,244][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:20:28,038][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:20:28,835][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:20:29,630][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:20:30,425][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:20:31,219][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:20:32,014][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:20:33,582][mllm.training.trainer_common][INFO] - For task: Apply reinforce step
,
            ΔVRAM Allocated: 2.288818359375e-05 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:26,
            Percentage of VRAM taken: 69.91579512747062%,
            
[2025-08-20 11:20:34,552][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/policy_optimizer_state.pt
[2025-08-20 11:20:34,554][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/critic_optimizer_state.pt
[2025-08-20 11:20:39,199][__main__][INFO] - Iteration 201 took 58s (35.94% Gen, 58.98% Train). Generation: 21s, Training: 34s. Estimated remaining time: 13h 4m 22s. Estimated total time: 16h 14m 30s. Time estimates for 10 more iterations: 9m 44s, 100 more iterations: 1h 37m 27s, 500 more iterations: 8h 7m 15s.
[2025-08-20 11:20:39,200][__main__][INFO] - Starting iteration 201.
[2025-08-20 11:21:02,507][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 11:21:02,508][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 11:21:02,515][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 11:21:04,993][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 11:21:04,994][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 11:21:05,001][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 11:21:05,003][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data.
[2025-08-20 11:21:05,004][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets.
[2025-08-20 11:21:05,304][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:21:06,097][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:21:06,888][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:21:07,682][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:21:08,477][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:21:09,270][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:21:10,062][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:21:10,859][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:21:11,653][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:21:12,445][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:21:13,241][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:21:14,036][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:21:14,830][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:21:15,624][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:21:16,420][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:21:17,217][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:21:18,013][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:21:18,807][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:21:20,174][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:21:20,968][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:21:25,158][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:21:25,953][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:21:26,746][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:21:27,538][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:21:28,334][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:21:29,129][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:21:29,922][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:21:30,717][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:21:31,513][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:21:32,309][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:21:33,104][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:21:33,898][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:21:35,516][mllm.training.trainer_common][INFO] - For task: Apply reinforce step
,
            ΔVRAM Allocated: 2.288818359375e-05 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:30,
            Percentage of VRAM taken: 69.91579512747062%,
            
[2025-08-20 11:21:41,705][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/policy_optimizer_state.pt
[2025-08-20 11:21:41,708][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/critic_optimizer_state.pt
[2025-08-20 11:21:43,206][__main__][INFO] - Iteration 202 took 1m 4s (32.55% Gen, 67.45% Train). Generation: 20s, Training: 43s. Estimated remaining time: 14h 35m 33s. Estimated total time: 17h 46m 45s. Time estimates for 10 more iterations: 10m 40s, 100 more iterations: 1h 46m 40s, 500 more iterations: 8h 53m 22s.
[2025-08-20 11:21:43,208][__main__][INFO] - Starting iteration 202.
[2025-08-20 11:22:06,635][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 11:22:06,636][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 11:22:06,642][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 11:22:09,094][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 11:22:09,096][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 11:22:09,102][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 11:22:09,104][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data.
[2025-08-20 11:22:09,105][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets.
[2025-08-20 11:22:09,403][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:22:10,193][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:22:10,983][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:22:11,777][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:22:12,568][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:22:13,360][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:22:14,154][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:22:14,946][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:22:15,740][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:22:16,534][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:22:17,326][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:22:18,120][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:22:18,915][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:22:19,707][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:22:20,499][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:22:21,294][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:22:22,087][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:22:22,879][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:22:23,676][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:22:24,469][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:22:25,263][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:22:26,057][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:22:27,420][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:22:28,215][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:22:29,009][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:22:29,801][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:22:30,595][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:22:31,389][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:22:32,182][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:22:32,976][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:22:33,773][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:22:34,567][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:22:36,176][mllm.training.trainer_common][INFO] - For task: Apply reinforce step
,
            ΔVRAM Allocated: 2.288818359375e-05 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:27,
            Percentage of VRAM taken: 69.91579512747062%,
            
[2025-08-20 11:22:37,130][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/policy_optimizer_state.pt
[2025-08-20 11:22:37,132][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/critic_optimizer_state.pt
[2025-08-20 11:22:38,479][__main__][INFO] - Iteration 203 took 55s (37.91% Gen, 62.09% Train). Generation: 20s, Training: 34s. Estimated remaining time: 12h 9m 3s. Estimated total time: 15h 21m 10s. Time estimates for 10 more iterations: 9m 12s, 100 more iterations: 1h 32m 7s, 500 more iterations: 7h 40m 35s.
[2025-08-20 11:22:38,480][__main__][INFO] - Starting iteration 203.
[2025-08-20 11:23:02,678][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 11:23:02,680][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 11:23:02,686][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 11:23:05,151][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 11:23:05,153][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 11:23:05,159][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 11:23:05,161][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data.
[2025-08-20 11:23:05,161][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets.
[2025-08-20 11:23:05,460][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:23:06,250][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:23:07,043][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:23:07,836][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:23:08,628][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:23:09,422][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:23:10,216][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:23:11,009][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:23:11,803][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:23:12,599][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:23:13,394][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:23:14,186][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:23:14,982][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:23:15,777][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:23:16,570][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:23:17,363][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:23:18,160][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:23:18,954][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:23:19,749][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:23:20,542][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:23:21,338][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:23:22,134][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:23:22,929][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:23:23,723][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:23:24,520][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:23:25,318][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:23:26,114][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:23:26,909][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:23:27,704][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:23:28,934][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:23:29,729][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:23:30,522][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:23:32,128][mllm.training.trainer_common][INFO] - For task: Apply reinforce step
,
            ΔVRAM Allocated: 2.288818359375e-05 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:26,
            Percentage of VRAM taken: 69.91579512747062%,
            
[2025-08-20 11:23:33,040][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/policy_optimizer_state.pt
[2025-08-20 11:23:33,042][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/critic_optimizer_state.pt
[2025-08-20 11:23:34,407][__main__][INFO] - Iteration 204 took 55s (38.86% Gen, 61.14% Train). Generation: 21s, Training: 34s. Estimated remaining time: 12h 18m 58s. Estimated total time: 15h 32m 1s. Time estimates for 10 more iterations: 9m 19s, 100 more iterations: 1h 33m 12s, 500 more iterations: 7h 46m 0s.
[2025-08-20 11:23:34,408][__main__][INFO] - Starting iteration 204.
[2025-08-20 11:23:57,807][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 11:23:57,808][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 11:23:57,815][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 11:24:00,292][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 11:24:00,293][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 11:24:00,300][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 11:24:00,302][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data.
[2025-08-20 11:24:00,303][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets.
[2025-08-20 11:24:00,607][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:24:01,397][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:24:02,187][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:24:02,979][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:24:03,770][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:24:04,564][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:24:05,356][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:24:06,147][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:24:06,942][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:24:07,734][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:24:08,526][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:24:09,322][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:24:10,114][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:24:10,906][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:24:11,701][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:24:12,494][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:24:13,286][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:24:14,083][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:24:14,878][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:24:15,670][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:24:16,463][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:24:17,746][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:24:18,541][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:24:19,337][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:24:20,132][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:24:20,927][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:24:21,723][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:24:22,521][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:24:23,320][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:24:24,117][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:24:24,912][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:24:25,707][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:24:27,325][mllm.training.trainer_common][INFO] - For task: Apply reinforce step
,
            ΔVRAM Allocated: 2.288818359375e-05 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:26,
            Percentage of VRAM taken: 69.91579512747062%,
            
[2025-08-20 11:24:28,300][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/policy_optimizer_state.pt
[2025-08-20 11:24:28,302][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/critic_optimizer_state.pt
[2025-08-20 11:24:29,617][__main__][INFO] - Iteration 205 took 55s (37.91% Gen, 62.09% Train). Generation: 20s, Training: 34s. Estimated remaining time: 12h 6m 10s. Estimated total time: 15h 20m 8s. Time estimates for 10 more iterations: 9m 12s, 100 more iterations: 1h 32m 0s, 500 more iterations: 7h 40m 4s.
[2025-08-20 11:24:29,619][__main__][INFO] - Starting iteration 205.
[2025-08-20 11:24:53,662][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 11:24:53,663][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 11:24:53,669][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 11:24:56,133][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 11:24:56,134][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 11:24:56,140][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 11:24:56,143][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data.
[2025-08-20 11:24:56,143][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets.
[2025-08-20 11:24:56,441][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:24:57,231][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:24:58,026][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:24:58,817][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:24:59,611][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:25:00,407][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:25:01,201][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:25:01,993][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:25:02,788][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:25:03,583][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:25:04,378][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:25:05,171][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:25:05,967][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:25:06,763][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:25:07,556][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:25:08,350][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:25:09,147][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:25:09,944][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:25:10,739][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:25:11,533][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:25:12,329][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:25:13,604][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:25:14,398][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:25:15,193][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:25:15,987][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:25:16,785][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:25:17,583][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:25:18,379][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:25:19,175][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:25:19,970][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:25:20,766][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:25:21,564][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:25:23,198][mllm.training.trainer_common][INFO] - For task: Apply reinforce step
,
            ΔVRAM Allocated: 2.288818359375e-05 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:27,
            Percentage of VRAM taken: 69.91579512747062%,
            
[2025-08-20 11:25:24,184][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/policy_optimizer_state.pt
[2025-08-20 11:25:24,186][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/critic_optimizer_state.pt
[2025-08-20 11:25:25,514][__main__][INFO] - Iteration 206 took 55s (38.66% Gen, 61.34% Train). Generation: 21s, Training: 34s. Estimated remaining time: 12h 16m 40s. Estimated total time: 15h 31m 35s. Time estimates for 10 more iterations: 9m 18s, 100 more iterations: 1h 33m 9s, 500 more iterations: 7h 45m 47s.
[2025-08-20 11:25:25,516][__main__][INFO] - Starting iteration 206.
[2025-08-20 11:25:49,679][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 11:25:49,680][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 11:25:49,687][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 11:25:52,125][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 11:25:52,126][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 11:25:52,133][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 11:25:52,134][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data.
[2025-08-20 11:25:52,135][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets.
[2025-08-20 11:25:52,443][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:25:53,233][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:25:54,028][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:25:54,820][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:25:55,613][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:25:56,407][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:25:57,200][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:25:57,992][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:25:58,789][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:25:59,584][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:26:00,377][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:26:01,171][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:26:01,967][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:26:02,760][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:26:03,554][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:26:04,350][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:26:05,147][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:26:05,942][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:26:06,735][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:26:07,530][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:26:08,326][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:26:09,122][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:26:09,917][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:26:10,711][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:26:11,508][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:26:12,738][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:26:13,532][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:26:14,328][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:26:15,126][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:26:15,924][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:26:16,721][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:26:17,517][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:26:19,122][mllm.training.trainer_common][INFO] - For task: Apply reinforce step
,
            ΔVRAM Allocated: 2.288818359375e-05 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:26,
            Percentage of VRAM taken: 69.91579512747062%,
            
[2025-08-20 11:26:20,196][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/policy_optimizer_state.pt
[2025-08-20 11:26:20,198][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/critic_optimizer_state.pt
[2025-08-20 11:26:21,630][__main__][INFO] - Iteration 207 took 56s (38.70% Gen, 61.30% Train). Generation: 21s, Training: 34s. Estimated remaining time: 12h 19m 23s. Estimated total time: 15h 35m 13s. Time estimates for 10 more iterations: 9m 21s, 100 more iterations: 1h 33m 31s, 500 more iterations: 7h 47m 36s.
[2025-08-20 11:26:21,631][__main__][INFO] - Starting iteration 207.
[2025-08-20 11:26:44,878][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 11:26:44,879][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 11:26:44,885][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 11:26:47,336][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 11:26:47,337][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 11:26:47,344][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 11:26:47,346][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data.
[2025-08-20 11:26:47,346][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets.
[2025-08-20 11:26:47,650][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:26:48,439][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:26:49,233][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:26:50,026][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:26:50,818][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:26:51,613][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:26:52,409][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:26:53,202][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:26:53,995][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:26:54,791][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:26:55,586][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:26:56,378][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:26:57,172][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:26:57,968][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:26:58,760][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:26:59,555][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:27:00,352][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:27:01,606][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:27:02,399][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:27:03,194][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:27:03,991][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:27:04,788][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:27:05,584][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:27:06,378][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:27:07,173][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:27:07,970][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:27:08,768][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:27:09,565][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:27:10,360][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:27:11,155][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:27:11,950][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:27:12,748][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:27:14,389][mllm.training.trainer_common][INFO] - For task: Apply reinforce step
,
            ΔVRAM Allocated: 2.288818359375e-05 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:27,
            Percentage of VRAM taken: 69.91579512747062%,
            
[2025-08-20 11:27:15,407][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/policy_optimizer_state.pt
[2025-08-20 11:27:15,409][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/critic_optimizer_state.pt
[2025-08-20 11:27:16,762][__main__][INFO] - Iteration 208 took 55s (37.72% Gen, 62.28% Train). Generation: 20s, Training: 34s. Estimated remaining time: 12h 2m 4s. Estimated total time: 15h 18m 49s. Time estimates for 10 more iterations: 9m 11s, 100 more iterations: 1h 31m 52s, 500 more iterations: 7h 39m 24s.
[2025-08-20 11:27:16,763][__main__][INFO] - Starting iteration 208.
[2025-08-20 11:27:40,370][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 11:27:40,372][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 11:27:40,378][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 11:27:42,834][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 11:27:42,836][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 11:27:42,842][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 11:27:42,844][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data.
[2025-08-20 11:27:42,845][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets.
[2025-08-20 11:27:43,142][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:27:43,932][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:27:44,720][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:27:45,513][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:27:46,303][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:27:47,095][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:27:47,886][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:27:48,676][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:27:49,470][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:27:50,261][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:27:51,053][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:27:51,844][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:27:52,636][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:27:53,430][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:27:54,223][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:27:55,015][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:27:55,810][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:27:56,602][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:27:57,396][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:27:58,192][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:27:58,986][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:27:59,779][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:28:00,574][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:28:01,370][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:28:02,620][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:28:03,414][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:28:04,210][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:28:05,006][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:28:05,800][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:28:06,594][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:28:07,392][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:28:08,190][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:28:09,824][mllm.training.trainer_common][INFO] - For task: Apply reinforce step
,
            ΔVRAM Allocated: 2.288818359375e-05 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:26,
            Percentage of VRAM taken: 69.91579512747062%,
            
[2025-08-20 11:28:10,789][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/policy_optimizer_state.pt
[2025-08-20 11:28:10,791][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/critic_optimizer_state.pt
[2025-08-20 11:28:12,108][__main__][INFO] - Iteration 209 took 55s (38.23% Gen, 61.77% Train). Generation: 21s, Training: 34s. Estimated remaining time: 12h 4m 43s. Estimated total time: 15h 22m 24s. Time estimates for 10 more iterations: 9m 13s, 100 more iterations: 1h 32m 14s, 500 more iterations: 7h 41m 12s.
[2025-08-20 11:28:12,110][__main__][INFO] - Starting iteration 209.
[2025-08-20 11:28:35,394][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 11:28:35,395][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 11:28:35,402][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 11:28:37,877][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 11:28:37,879][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 11:28:37,885][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 11:28:37,887][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data.
[2025-08-20 11:28:37,888][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets.
[2025-08-20 11:28:38,187][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:28:38,978][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:28:39,772][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:28:40,565][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:28:41,360][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:28:42,155][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:28:42,947][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:28:43,741][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:28:44,537][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:28:45,331][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:28:46,124][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:28:46,919][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:28:47,715][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:28:48,510][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:28:49,305][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:28:50,099][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:28:50,897][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:28:51,695][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:28:52,492][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:28:53,287][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:28:54,516][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:28:55,312][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:28:56,107][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:28:56,903][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:28:57,698][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:28:58,495][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:28:59,294][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:29:00,091][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:29:00,886][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:29:01,681][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:29:02,477][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:29:03,274][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:29:04,903][mllm.training.trainer_common][INFO] - For task: Apply reinforce step
,
            ΔVRAM Allocated: 2.288818359375e-05 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:26,
            Percentage of VRAM taken: 69.91579512747062%,
            
[2025-08-20 11:29:05,834][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/policy_optimizer_state.pt
[2025-08-20 11:29:05,836][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/critic_optimizer_state.pt
[2025-08-20 11:29:07,180][__main__][INFO] - Iteration 210 took 55s (37.79% Gen, 62.21% Train). Generation: 20s, Training: 34s. Estimated remaining time: 11h 59m 13s. Estimated total time: 15h 17m 49s. Time estimates for 10 more iterations: 9m 10s, 100 more iterations: 1h 31m 46s, 500 more iterations: 7h 38m 54s.
[2025-08-20 11:29:07,181][__main__][INFO] - Starting iteration 210.
[2025-08-20 11:29:31,061][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 11:29:31,062][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 11:29:31,068][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 11:29:33,513][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 11:29:33,514][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 11:29:33,521][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 11:29:33,523][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data.
[2025-08-20 11:29:33,523][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets.
[2025-08-20 11:29:33,823][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:29:34,614][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:29:35,405][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:29:36,200][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:29:36,994][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:29:37,786][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:29:38,582][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:29:39,377][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:29:40,170][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:29:40,963][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:29:41,760][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:29:42,556][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:29:43,350][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:29:44,144][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:29:44,940][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:29:45,737][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:29:46,534][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:29:47,328][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:29:48,124][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:29:48,921][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:29:49,719][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:29:50,517][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:29:51,313][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:29:52,656][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:29:53,452][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:29:54,247][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:29:55,042][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:29:55,839][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:29:56,637][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:29:57,433][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:29:58,230][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:29:59,025][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:30:00,627][mllm.training.trainer_common][INFO] - For task: Apply reinforce step
,
            ΔVRAM Allocated: 2.288818359375e-05 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:27,
            Percentage of VRAM taken: 69.91579512747062%,
            
[2025-08-20 11:30:01,571][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/policy_optimizer_state.pt
[2025-08-20 11:30:01,572][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/critic_optimizer_state.pt
[2025-08-20 11:30:03,243][__main__][INFO] - Iteration 211 took 56s (38.21% Gen, 61.79% Train). Generation: 21s, Training: 34s. Estimated remaining time: 12h 14m 49s. Estimated total time: 15h 34m 21s. Time estimates for 10 more iterations: 9m 20s, 100 more iterations: 1h 33m 26s, 500 more iterations: 7h 47m 10s.
[2025-08-20 11:30:03,245][__main__][INFO] - Starting iteration 211.
[2025-08-20 11:30:26,527][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 11:30:26,529][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 11:30:26,536][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 11:30:29,000][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 11:30:29,001][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 11:30:29,008][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 11:30:29,010][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data.
[2025-08-20 11:30:29,011][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets.
[2025-08-20 11:30:29,313][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:30:30,104][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:30:30,897][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:30:31,689][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:30:32,484][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:30:33,277][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:30:34,070][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:30:34,865][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:30:35,661][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:30:36,454][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:30:37,248][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:30:38,045][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:30:38,841][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:30:39,636][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:30:40,431][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:30:41,225][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:30:42,023][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:30:42,821][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:30:43,618][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:30:44,412][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:30:45,207][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:30:46,004][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:30:46,802][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:30:48,127][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:30:48,923][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:30:49,720][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:30:50,518][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:30:51,315][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:30:52,111][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:30:52,906][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:30:53,702][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:30:54,499][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:30:56,119][mllm.training.trainer_common][INFO] - For task: Apply reinforce step
,
            ΔVRAM Allocated: 2.288818359375e-05 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:27,
            Percentage of VRAM taken: 69.91579512747062%,
            
[2025-08-20 11:30:57,084][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/policy_optimizer_state.pt
[2025-08-20 11:30:57,086][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/critic_optimizer_state.pt
[2025-08-20 11:30:58,452][__main__][INFO] - Iteration 212 took 55s (37.72% Gen, 62.28% Train). Generation: 20s, Training: 34s. Estimated remaining time: 11h 59m 39s. Estimated total time: 15h 20m 6s. Time estimates for 10 more iterations: 9m 12s, 100 more iterations: 1h 32m 0s, 500 more iterations: 7h 40m 3s.
[2025-08-20 11:30:58,453][__main__][INFO] - Starting iteration 212.
[2025-08-20 11:31:22,454][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 11:31:22,455][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 11:31:22,462][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 11:31:24,911][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 11:31:24,912][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 11:31:24,919][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 11:31:24,921][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data.
[2025-08-20 11:31:24,921][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets.
[2025-08-20 11:31:25,220][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:31:26,009][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:31:26,803][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:31:27,595][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:31:28,388][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:31:29,184][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:31:29,976][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:31:30,769][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:31:31,565][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:31:32,358][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:31:33,151][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:31:33,946][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:31:34,743][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:31:35,537][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:31:36,330][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:31:37,126][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:31:37,923][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:31:38,719][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:31:39,513][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:31:40,824][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:31:41,622][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:31:42,418][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:31:43,213][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:31:44,007][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:31:44,804][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:31:45,603][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:31:46,401][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:31:47,198][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:31:47,993][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:31:48,789][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:31:49,585][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:31:50,382][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:31:52,003][mllm.training.trainer_common][INFO] - For task: Apply reinforce step
,
            ΔVRAM Allocated: 2.288818359375e-05 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:27,
            Percentage of VRAM taken: 69.91579512747062%,
            
[2025-08-20 11:31:52,972][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/policy_optimizer_state.pt
[2025-08-20 11:31:52,974][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/critic_optimizer_state.pt
[2025-08-20 11:31:54,434][__main__][INFO] - Iteration 213 took 55s (38.52% Gen, 61.48% Train). Generation: 21s, Training: 34s. Estimated remaining time: 12h 11m 37s. Estimated total time: 15h 33m 0s. Time estimates for 10 more iterations: 9m 19s, 100 more iterations: 1h 33m 18s, 500 more iterations: 7h 46m 30s.
[2025-08-20 11:31:54,436][__main__][INFO] - Starting iteration 213.
[2025-08-20 11:32:20,463][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 11:32:20,464][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 11:32:20,471][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 11:32:22,917][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 11:32:22,918][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 11:32:22,924][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 11:32:22,927][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data.
[2025-08-20 11:32:22,927][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets.
[2025-08-20 11:32:23,228][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:32:24,017][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:32:24,809][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:32:25,603][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:32:26,394][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:32:27,188][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:32:27,983][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:32:31,239][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:32:32,030][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:32:32,824][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:32:33,617][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:32:34,409][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:32:35,206][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:32:36,000][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:32:37,071][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:32:37,866][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:32:38,662][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:32:39,456][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:32:41,128][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:32:43,060][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:32:43,853][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:32:44,647][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:32:45,443][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:32:46,764][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:32:47,559][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:32:48,351][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:32:49,144][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:32:49,940][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:32:50,735][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:32:51,527][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:32:52,322][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:32:53,119][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:32:54,695][mllm.training.trainer_common][INFO] - For task: Apply reinforce step
,
            ΔVRAM Allocated: 2.288818359375e-05 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:31,
            Percentage of VRAM taken: 69.91579512747062%,
            
[2025-08-20 11:32:55,655][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/policy_optimizer_state.pt
[2025-08-20 11:32:55,657][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/critic_optimizer_state.pt
[2025-08-20 11:32:57,045][__main__][INFO] - Iteration 214 took 1m 2s (37.69% Gen, 62.31% Train). Generation: 23s, Training: 39s. Estimated remaining time: 14h 1m 2s. Estimated total time: 17h 23m 28s. Time estimates for 10 more iterations: 10m 26s, 100 more iterations: 1h 44m 20s, 500 more iterations: 8h 41m 44s.
[2025-08-20 11:32:57,046][__main__][INFO] - Starting iteration 214.
[2025-08-20 11:33:20,311][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 11:33:20,313][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 11:33:20,319][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 11:33:22,795][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 11:33:22,796][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 11:33:22,802][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 11:33:22,804][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data.
[2025-08-20 11:33:22,805][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets.
[2025-08-20 11:33:23,103][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:33:23,891][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:33:24,685][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:33:25,477][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:33:26,271][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:33:27,065][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:33:27,857][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:33:28,650][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:33:29,445][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:33:30,239][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:33:31,032][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:33:31,829][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:33:32,625][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:33:33,419][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:33:34,213][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:33:35,009][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:33:35,807][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:33:36,603][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:33:37,397][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:33:38,190][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:33:38,985][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:33:39,780][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:33:40,574][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:33:41,369][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:33:42,165][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:33:42,962][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:33:43,756][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:33:44,550][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:33:45,837][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:33:46,629][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:33:47,425][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:33:48,222][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:33:49,808][mllm.training.trainer_common][INFO] - For task: Apply reinforce step
,
            ΔVRAM Allocated: 2.288818359375e-05 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:26,
            Percentage of VRAM taken: 69.91579512747062%,
            
[2025-08-20 11:33:50,753][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/policy_optimizer_state.pt
[2025-08-20 11:33:50,755][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/critic_optimizer_state.pt
[2025-08-20 11:33:52,143][__main__][INFO] - Iteration 215 took 55s (37.72% Gen, 62.28% Train). Generation: 20s, Training: 34s. Estimated remaining time: 11h 54m 54s. Estimated total time: 15h 18m 16s. Time estimates for 10 more iterations: 9m 10s, 100 more iterations: 1h 31m 49s, 500 more iterations: 7h 39m 8s.
[2025-08-20 11:33:52,144][__main__][INFO] - Starting iteration 215.
[2025-08-20 11:34:15,420][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 11:34:15,421][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 11:34:15,427][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 11:34:17,879][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 11:34:17,880][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 11:34:17,886][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 11:34:17,888][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data.
[2025-08-20 11:34:17,889][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets.
[2025-08-20 11:34:18,187][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:34:18,976][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:34:19,771][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:34:20,952][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:34:21,744][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:34:22,534][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:34:23,329][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:34:24,120][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:34:25,641][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:34:29,284][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:34:30,077][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:34:30,872][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:34:31,668][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:34:32,461][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:34:33,255][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:34:34,051][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:34:34,849][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:34:35,644][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:34:36,438][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:34:37,231][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:34:38,027][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:34:38,823][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:34:40,052][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:34:40,849][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:34:42,803][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:34:44,368][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:34:45,164][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:34:45,957][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:34:46,749][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:34:47,544][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:34:48,338][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:34:49,130][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:34:50,688][mllm.training.trainer_common][INFO] - For task: Apply reinforce step
,
            ΔVRAM Allocated: 2.288818359375e-05 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:32,
            Percentage of VRAM taken: 69.91579512747062%,
            
[2025-08-20 11:34:51,669][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/policy_optimizer_state.pt
[2025-08-20 11:34:51,671][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/critic_optimizer_state.pt
[2025-08-20 11:34:53,368][__main__][INFO] - Iteration 216 took 1m 1s (34.00% Gen, 66.00% Train). Generation: 20s, Training: 40s. Estimated remaining time: 13h 36m 0s. Estimated total time: 17h 0m 23s. Time estimates for 10 more iterations: 10m 12s, 100 more iterations: 1h 42m 2s, 500 more iterations: 8h 30m 11s.
[2025-08-20 11:34:53,369][__main__][INFO] - Starting iteration 216.
[2025-08-20 11:35:18,824][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 11:35:18,826][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 11:35:18,832][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 11:35:21,291][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 11:35:21,292][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 11:35:21,298][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 11:35:21,301][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data.
[2025-08-20 11:35:21,301][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets.
[2025-08-20 11:35:21,599][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:35:22,388][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:35:23,176][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:35:23,969][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:35:24,759][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:35:25,550][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:35:26,341][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:35:27,132][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:35:27,924][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:35:28,714][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:35:29,509][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:35:30,300][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:35:31,092][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:35:31,884][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:35:32,675][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:35:33,469][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:35:34,262][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:35:35,054][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:35:35,849][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:35:36,642][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:35:38,003][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:35:38,798][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:35:39,591][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:35:40,387][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:35:41,180][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:35:41,972][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:35:42,767][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:35:43,562][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:35:44,355][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:35:45,150][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:35:45,946][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:35:46,742][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:35:48,294][mllm.training.trainer_common][INFO] - For task: Apply reinforce step
,
            ΔVRAM Allocated: 2.288818359375e-05 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:26,
            Percentage of VRAM taken: 69.91579512747062%,
            
[2025-08-20 11:35:49,249][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/policy_optimizer_state.pt
[2025-08-20 11:35:49,251][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/critic_optimizer_state.pt
[2025-08-20 11:35:52,073][__main__][INFO] - Iteration 217 took 58s (39.17% Gen, 60.83% Train). Generation: 22s, Training: 35s. Estimated remaining time: 12h 53m 2s. Estimated total time: 16h 18m 23s. Time estimates for 10 more iterations: 9m 47s, 100 more iterations: 1h 37m 50s, 500 more iterations: 8h 9m 11s.
[2025-08-20 11:35:52,075][__main__][INFO] - Starting iteration 217.
[2025-08-20 11:36:15,583][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 11:36:15,584][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 11:36:15,590][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 11:36:18,048][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 11:36:18,049][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 11:36:18,056][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 11:36:18,059][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data.
[2025-08-20 11:36:18,060][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets.
[2025-08-20 11:36:18,359][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:36:19,152][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:36:19,944][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:36:20,738][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:36:21,534][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:36:22,330][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:36:23,123][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:36:23,917][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:36:24,714][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:36:25,510][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:36:26,304][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:36:27,098][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:36:27,895][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:36:28,692][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:36:29,488][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:36:30,281][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:36:31,075][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:36:31,873][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:36:32,669][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:36:33,464][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:36:34,257][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:36:35,560][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:36:36,352][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:36:37,147][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:36:37,939][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:36:38,734][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:36:39,530][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:36:40,325][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:36:41,118][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:36:41,912][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:36:42,708][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:36:43,503][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:36:45,094][mllm.training.trainer_common][INFO] - For task: Apply reinforce step
,
            ΔVRAM Allocated: 2.288818359375e-05 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:26,
            Percentage of VRAM taken: 69.91579512747062%,
            
[2025-08-20 11:36:46,045][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/policy_optimizer_state.pt
[2025-08-20 11:36:46,046][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/critic_optimizer_state.pt
[2025-08-20 11:36:47,396][__main__][INFO] - Iteration 218 took 55s (38.01% Gen, 61.98% Train). Generation: 21s, Training: 34s. Estimated remaining time: 11h 55m 44s. Estimated total time: 15h 22m 1s. Time estimates for 10 more iterations: 9m 13s, 100 more iterations: 1h 32m 12s, 500 more iterations: 7h 41m 0s.
[2025-08-20 11:36:47,398][__main__][INFO] - Starting iteration 218.
[2025-08-20 11:37:11,061][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 11:37:11,062][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 11:37:11,068][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 11:37:13,532][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 11:37:13,533][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 11:37:13,540][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 11:37:13,542][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data.
[2025-08-20 11:37:13,542][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets.
[2025-08-20 11:37:13,846][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:37:14,637][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:37:15,430][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:37:16,223][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:37:17,018][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:37:17,814][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:37:18,608][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:37:19,401][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:37:20,198][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:37:20,994][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:37:21,789][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:37:22,583][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:37:23,379][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:37:24,175][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:37:24,972][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:37:25,765][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:37:26,559][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:37:27,356][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:37:28,154][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:37:29,377][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:37:30,173][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:37:30,967][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:37:31,760][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:37:32,553][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:37:33,349][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:37:34,142][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:37:34,936][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:37:35,732][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:37:36,528][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:37:37,322][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:37:38,116][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:37:38,913][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:37:40,473][mllm.training.trainer_common][INFO] - For task: Apply reinforce step
,
            ΔVRAM Allocated: 2.288818359375e-05 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:26,
            Percentage of VRAM taken: 69.91579512747062%,
            
[2025-08-20 11:37:41,430][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/policy_optimizer_state.pt
[2025-08-20 11:37:41,431][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/critic_optimizer_state.pt
[2025-08-20 11:37:42,779][__main__][INFO] - Iteration 219 took 55s (38.29% Gen, 61.70% Train). Generation: 21s, Training: 34s. Estimated remaining time: 11h 55m 48s. Estimated total time: 15h 23m 0s. Time estimates for 10 more iterations: 9m 13s, 100 more iterations: 1h 32m 18s, 500 more iterations: 7h 41m 30s.
[2025-08-20 11:37:42,781][__main__][INFO] - Starting iteration 219.
[2025-08-20 11:38:06,115][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 11:38:06,116][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 11:38:06,122][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 11:38:08,596][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 11:38:08,597][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 11:38:08,604][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 11:38:08,606][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data.
[2025-08-20 11:38:08,606][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets.
[2025-08-20 11:38:08,910][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:38:09,702][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:38:10,497][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:38:11,290][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:38:12,084][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:38:12,880][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:38:13,675][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:38:14,468][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:38:15,264][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:38:16,060][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:38:16,856][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:38:17,650][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:38:18,444][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:38:19,240][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:38:20,037][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:38:20,832][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:38:21,625][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:38:22,420][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:38:23,217][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:38:24,011][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:38:25,335][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:38:26,127][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:38:26,919][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:38:27,713][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:38:28,505][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:38:29,299][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:38:30,096][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:38:30,890][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:38:31,684][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:38:32,478][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:38:33,274][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:38:34,070][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:38:35,663][mllm.training.trainer_common][INFO] - For task: Apply reinforce step
,
            ΔVRAM Allocated: 2.288818359375e-05 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:27,
            Percentage of VRAM taken: 69.91579512747062%,
            
[2025-08-20 11:38:36,605][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/policy_optimizer_state.pt
[2025-08-20 11:38:36,606][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/critic_optimizer_state.pt
[2025-08-20 11:38:37,969][__main__][INFO] - Iteration 220 took 55s (37.83% Gen, 62.17% Train). Generation: 20s, Training: 34s. Estimated remaining time: 11h 51m 40s. Estimated total time: 15h 19m 47s. Time estimates for 10 more iterations: 9m 11s, 100 more iterations: 1h 31m 58s, 500 more iterations: 7h 39m 53s.
[2025-08-20 11:38:37,970][__main__][INFO] - Starting iteration 220.
[2025-08-20 11:39:01,309][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 11:39:01,311][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 11:39:01,317][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 11:39:03,776][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 11:39:03,777][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 11:39:03,783][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 11:39:03,786][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data.
[2025-08-20 11:39:03,786][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets.
[2025-08-20 11:39:04,084][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:39:04,876][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:39:05,668][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:39:06,464][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:39:07,262][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:39:08,056][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:39:08,849][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:39:09,645][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:39:10,442][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:39:11,238][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:39:12,031][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:39:12,825][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:39:13,620][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:39:14,416][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:39:15,210][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:39:16,005][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:39:16,802][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:39:17,600][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:39:18,394][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:39:19,187][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:39:19,982][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:39:20,779][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:39:21,574][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:39:22,368][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:39:23,162][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:39:23,959][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:39:24,756][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:39:25,550][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:39:26,344][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:39:27,140][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:39:28,420][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:39:29,218][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:39:30,777][mllm.training.trainer_common][INFO] - For task: Apply reinforce step
,
            ΔVRAM Allocated: 2.288818359375e-05 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:26,
            Percentage of VRAM taken: 69.91579512747062%,
            
[2025-08-20 11:39:31,875][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/policy_optimizer_state.pt
[2025-08-20 11:39:31,877][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/critic_optimizer_state.pt
[2025-08-20 11:39:33,313][__main__][INFO] - Iteration 221 took 55s (37.72% Gen, 62.28% Train). Generation: 20s, Training: 34s. Estimated remaining time: 11h 53m 19s. Estimated total time: 15h 22m 22s. Time estimates for 10 more iterations: 9m 13s, 100 more iterations: 1h 32m 14s, 500 more iterations: 7h 41m 11s.
[2025-08-20 11:39:33,314][__main__][INFO] - Starting iteration 221.
[2025-08-20 11:39:56,492][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 11:39:56,493][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 11:39:56,500][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 11:39:58,954][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 11:39:58,956][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 11:39:58,962][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 11:39:58,965][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data.
[2025-08-20 11:39:58,966][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets.
[2025-08-20 11:39:59,265][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:40:00,057][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:40:00,850][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:40:01,646][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:40:02,442][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:40:03,235][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:40:04,028][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:40:04,825][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:40:05,621][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:40:06,414][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:40:07,208][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:40:08,005][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:40:08,801][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:40:09,595][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:40:10,388][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:40:11,186][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:40:11,983][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:40:12,777][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:40:13,570][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:40:14,363][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:40:15,158][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:40:15,951][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:40:17,232][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:40:18,026][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:40:18,822][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:40:19,618][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:40:20,411][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:40:21,205][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:40:22,001][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:40:22,797][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:40:23,591][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:40:24,384][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:40:25,959][mllm.training.trainer_common][INFO] - For task: Apply reinforce step
,
            ΔVRAM Allocated: 2.288818359375e-05 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:26,
            Percentage of VRAM taken: 69.91579512747062%,
            
[2025-08-20 11:40:26,922][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/policy_optimizer_state.pt
[2025-08-20 11:40:26,924][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/critic_optimizer_state.pt
[2025-08-20 11:40:28,283][__main__][INFO] - Iteration 222 took 54s (37.73% Gen, 62.27% Train). Generation: 20s, Training: 34s. Estimated remaining time: 11h 46m 11s. Estimated total time: 15h 16m 8s. Time estimates for 10 more iterations: 9m 9s, 100 more iterations: 1h 31m 36s, 500 more iterations: 7h 38m 4s.
[2025-08-20 11:40:28,285][__main__][INFO] - Starting iteration 222.
[2025-08-20 11:40:52,456][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 11:40:52,458][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 11:40:52,464][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 11:40:54,945][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 11:40:54,947][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 11:40:54,953][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 11:40:54,956][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data.
[2025-08-20 11:40:54,956][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets.
[2025-08-20 11:40:55,257][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:40:56,048][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:40:56,840][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:40:57,633][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:40:58,429][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:40:59,225][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:41:00,019][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:41:00,812][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:41:01,608][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:41:02,405][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:41:03,200][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:41:03,994][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:41:04,790][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:41:05,588][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:41:06,385][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:41:07,180][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:41:07,973][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:41:08,768][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:41:09,564][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:41:10,825][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:41:11,621][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:41:12,415][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:41:13,209][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:41:14,005][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:41:14,802][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:41:15,596][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:41:16,390][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:41:17,186][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:41:17,983][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:41:18,778][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:41:19,572][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:41:20,366][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:41:21,932][mllm.training.trainer_common][INFO] - For task: Apply reinforce step
,
            ΔVRAM Allocated: 2.288818359375e-05 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:26,
            Percentage of VRAM taken: 69.91579512747062%,
            
[2025-08-20 11:41:22,858][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/policy_optimizer_state.pt
[2025-08-20 11:41:22,860][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/critic_optimizer_state.pt
[2025-08-20 11:41:24,182][__main__][INFO] - Iteration 223 took 55s (38.82% Gen, 61.18% Train). Generation: 21s, Training: 34s. Estimated remaining time: 12h 0m 43s. Estimated total time: 15h 31m 36s. Time estimates for 10 more iterations: 9m 18s, 100 more iterations: 1h 33m 9s, 500 more iterations: 7h 45m 48s.
[2025-08-20 11:41:24,184][__main__][INFO] - Starting iteration 223.
[2025-08-20 11:41:47,823][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 11:41:47,824][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 11:41:47,831][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 11:41:50,288][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 11:41:50,290][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 11:41:50,296][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 11:41:50,299][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data.
[2025-08-20 11:41:50,299][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets.
[2025-08-20 11:41:50,603][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:41:51,393][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:41:52,187][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:41:52,979][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:41:53,774][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:41:54,571][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:41:55,366][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:41:56,159][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:41:56,953][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:41:57,751][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:41:58,547][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:41:59,340][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:42:00,134][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:42:00,930][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:42:01,727][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:42:02,521][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:42:03,316][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:42:04,111][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:42:05,311][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:42:06,105][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:42:06,898][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:42:07,691][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:42:08,486][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:42:09,281][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:42:10,074][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:42:10,869][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:42:11,667][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:42:12,463][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:42:13,257][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:42:14,051][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:42:14,847][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:42:15,645][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:42:17,200][mllm.training.trainer_common][INFO] - For task: Apply reinforce step
,
            ΔVRAM Allocated: 2.288818359375e-05 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:26,
            Percentage of VRAM taken: 69.91579512747062%,
            
[2025-08-20 11:42:18,195][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/policy_optimizer_state.pt
[2025-08-20 11:42:18,196][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/critic_optimizer_state.pt
[2025-08-20 11:42:19,517][__main__][INFO] - Iteration 224 took 55s (38.32% Gen, 61.68% Train). Generation: 21s, Training: 34s. Estimated remaining time: 11h 50m 24s. Estimated total time: 15h 22m 13s. Time estimates for 10 more iterations: 9m 13s, 100 more iterations: 1h 32m 13s, 500 more iterations: 7h 41m 6s.
[2025-08-20 11:42:19,519][__main__][INFO] - Starting iteration 224.
[2025-08-20 11:42:42,825][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 11:42:42,826][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 11:42:42,832][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 11:42:45,321][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 11:42:45,323][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 11:42:45,329][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 11:42:45,331][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data.
[2025-08-20 11:42:45,332][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets.
[2025-08-20 11:42:45,631][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:42:46,422][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:42:47,216][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:42:48,012][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:42:48,807][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:42:49,601][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:42:50,397][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:42:51,193][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:42:51,988][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:42:52,781][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:42:53,575][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:42:54,372][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:42:55,166][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:42:55,960][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:42:56,754][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:42:57,552][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:42:58,348][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:42:59,141][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:42:59,933][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:43:00,729][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:43:01,523][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:43:02,315][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:43:03,110][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:43:04,372][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:43:05,169][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:43:05,965][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:43:06,759][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:43:07,553][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:43:08,349][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:43:09,145][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:43:09,939][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:43:10,733][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:43:12,299][mllm.training.trainer_common][INFO] - For task: Apply reinforce step
,
            ΔVRAM Allocated: 2.288818359375e-05 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:26,
            Percentage of VRAM taken: 69.91579512747062%,
            
[2025-08-20 11:43:13,285][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/policy_optimizer_state.pt
[2025-08-20 11:43:13,286][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/critic_optimizer_state.pt
[2025-08-20 11:43:14,585][__main__][INFO] - Iteration 225 took 55s (37.82% Gen, 62.18% Train). Generation: 20s, Training: 34s. Estimated remaining time: 11h 45m 2s. Estimated total time: 15h 17m 46s. Time estimates for 10 more iterations: 9m 10s, 100 more iterations: 1h 31m 46s, 500 more iterations: 7h 38m 53s.
[2025-08-20 11:43:14,587][__main__][INFO] - Starting iteration 225.
[2025-08-20 11:43:37,944][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 11:43:37,948][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 11:43:37,957][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 11:43:40,407][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 11:43:40,408][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 11:43:40,415][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 11:43:40,417][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data.
[2025-08-20 11:43:40,418][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets.
[2025-08-20 11:43:40,716][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:43:41,508][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:43:42,301][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:43:43,097][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:43:43,892][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:43:44,686][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:43:45,480][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:43:46,277][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:43:47,073][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:43:47,867][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:43:48,661][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:43:49,457][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:43:50,256][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:43:51,052][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:43:51,847][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:43:52,641][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:43:53,438][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:43:54,236][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:43:55,033][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:43:55,827][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:43:56,620][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:43:57,414][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:43:58,211][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:43:59,004][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:44:00,262][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:44:01,055][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:44:01,851][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:44:02,647][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:44:03,441][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:44:04,234][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:44:05,031][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:44:05,828][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:44:07,418][mllm.training.trainer_common][INFO] - For task: Apply reinforce step
,
            ΔVRAM Allocated: 2.288818359375e-05 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:26,
            Percentage of VRAM taken: 69.91579512747062%,
            
[2025-08-20 11:44:08,380][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/policy_optimizer_state.pt
[2025-08-20 11:44:08,381][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/critic_optimizer_state.pt
[2025-08-20 11:44:09,847][__main__][INFO] - Iteration 226 took 55s (37.85% Gen, 62.15% Train). Generation: 20s, Training: 34s. Estimated remaining time: 11h 47m 21s. Estimated total time: 15h 20m 59s. Time estimates for 10 more iterations: 9m 12s, 100 more iterations: 1h 32m 5s, 500 more iterations: 7h 40m 29s.
[2025-08-20 11:44:09,849][__main__][INFO] - Starting iteration 226.
[2025-08-20 11:44:33,274][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 11:44:33,275][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 11:44:33,282][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 11:44:35,780][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 11:44:35,782][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 11:44:35,788][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 11:44:35,791][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data.
[2025-08-20 11:44:35,791][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets.
[2025-08-20 11:44:36,089][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:44:36,881][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:44:37,675][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:44:38,467][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:44:39,262][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:44:40,059][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:44:40,854][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:44:41,647][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:44:42,441][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:44:43,238][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:44:44,033][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:44:44,827][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:44:45,621][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:44:46,419][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:44:47,216][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:44:48,011][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:44:48,804][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:44:49,598][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:44:50,394][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:44:51,187][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:44:51,980][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:44:52,776][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:44:53,572][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:44:54,365][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:44:55,159][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:44:55,956][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:44:56,753][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:44:57,547][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:44:58,340][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:44:59,547][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:45:00,340][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:45:01,133][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:45:02,687][mllm.training.trainer_common][INFO] - For task: Apply reinforce step
,
            ΔVRAM Allocated: 2.288818359375e-05 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:26,
            Percentage of VRAM taken: 69.91579512747062%,
            
[2025-08-20 11:45:03,732][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/policy_optimizer_state.pt
[2025-08-20 11:45:03,735][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/critic_optimizer_state.pt
[2025-08-20 11:45:05,216][__main__][INFO] - Iteration 227 took 55s (37.80% Gen, 62.19% Train). Generation: 20s, Training: 34s. Estimated remaining time: 11h 48m 12s. Estimated total time: 15h 22m 47s. Time estimates for 10 more iterations: 9m 13s, 100 more iterations: 1h 32m 16s, 500 more iterations: 7h 41m 23s.
[2025-08-20 11:45:05,218][__main__][INFO] - Starting iteration 227.
[2025-08-20 11:45:28,478][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 11:45:28,480][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 11:45:28,486][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 11:45:30,946][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 11:45:30,948][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 11:45:30,954][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 11:45:30,956][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data.
[2025-08-20 11:45:30,957][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets.
[2025-08-20 11:45:31,257][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:45:32,047][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:45:32,843][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:45:33,637][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:45:34,430][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:45:35,224][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:45:36,021][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:45:36,816][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:45:37,610][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:45:38,405][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:45:39,202][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:45:39,999][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:45:40,795][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:45:41,589][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:45:42,384][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:45:43,182][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:45:43,978][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:45:44,772][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:45:45,565][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:45:46,360][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:45:47,157][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:45:47,950][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:45:48,744][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:45:50,011][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:45:50,805][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:45:51,600][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:45:52,397][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:45:53,193][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:45:53,987][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:45:54,780][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:45:55,575][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:45:56,371][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:45:57,928][mllm.training.trainer_common][INFO] - For task: Apply reinforce step
,
            ΔVRAM Allocated: 2.288818359375e-05 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:26,
            Percentage of VRAM taken: 69.91579512747062%,
            
[2025-08-20 11:45:58,860][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/policy_optimizer_state.pt
[2025-08-20 11:45:58,862][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/critic_optimizer_state.pt
[2025-08-20 11:46:00,245][__main__][INFO] - Iteration 228 took 55s (37.79% Gen, 62.21% Train). Generation: 20s, Training: 34s. Estimated remaining time: 11h 41m 37s. Estimated total time: 15h 17m 6s. Time estimates for 10 more iterations: 9m 10s, 100 more iterations: 1h 31m 42s, 500 more iterations: 7h 38m 33s.
[2025-08-20 11:46:00,247][__main__][INFO] - Starting iteration 228.
[2025-08-20 11:46:23,840][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 11:46:23,842][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 11:46:23,848][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 11:46:26,309][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 11:46:26,310][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 11:46:26,317][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 11:46:26,319][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data.
[2025-08-20 11:46:26,320][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets.
[2025-08-20 11:46:26,619][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:46:27,410][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:46:28,206][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:46:29,001][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:46:29,794][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:46:30,588][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:46:31,385][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:46:32,180][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:46:32,975][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:46:33,769][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:46:34,566][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:46:35,364][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:46:36,161][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:46:36,955][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:46:37,749][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:46:38,544][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:46:39,339][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:46:40,132][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:46:40,925][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:46:42,230][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:46:43,024][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:46:43,820][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:46:44,616][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:46:45,410][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:46:46,204][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:46:47,001][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:46:47,796][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:46:48,591][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:46:49,384][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:46:50,181][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:46:50,978][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:46:51,774][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:46:53,363][mllm.training.trainer_common][INFO] - For task: Apply reinforce step
,
            ΔVRAM Allocated: 2.288818359375e-05 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:27,
            Percentage of VRAM taken: 69.91579512747062%,
            
[2025-08-20 11:46:54,295][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/policy_optimizer_state.pt
[2025-08-20 11:46:54,297][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/critic_optimizer_state.pt
[2025-08-20 11:46:55,668][__main__][INFO] - Iteration 229 took 55s (38.13% Gen, 61.87% Train). Generation: 21s, Training: 34s. Estimated remaining time: 11h 47m 15s. Estimated total time: 15h 23m 40s. Time estimates for 10 more iterations: 9m 14s, 100 more iterations: 1h 32m 22s, 500 more iterations: 7h 41m 50s.
[2025-08-20 11:46:55,669][__main__][INFO] - Starting iteration 229.
[2025-08-20 11:47:18,885][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 11:47:18,886][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 11:47:18,892][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 11:47:21,351][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 11:47:21,352][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 11:47:21,359][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 11:47:21,361][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data.
[2025-08-20 11:47:21,362][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets.
[2025-08-20 11:47:21,660][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:47:22,450][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:47:23,245][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:47:24,038][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:47:24,832][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:47:25,629][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:47:26,425][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:47:27,219][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:47:28,013][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:47:28,809][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:47:29,607][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:47:30,403][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:47:31,198][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:47:31,991][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:47:32,788][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:47:34,024][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:47:34,818][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:47:35,611][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:47:36,405][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:47:37,200][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:47:37,993][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:47:38,786][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:47:39,583][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:47:40,377][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:47:41,171][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:47:41,965][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:47:42,763][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:47:43,558][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:47:44,352][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:47:45,146][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:47:45,943][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:47:46,740][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:47:48,295][mllm.training.trainer_common][INFO] - For task: Apply reinforce step
,
            ΔVRAM Allocated: 2.288818359375e-05 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:26,
            Percentage of VRAM taken: 69.91579512747062%,
            
[2025-08-20 11:47:49,253][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/policy_optimizer_state.pt
[2025-08-20 11:47:49,255][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/critic_optimizer_state.pt
[2025-08-20 11:47:50,588][__main__][INFO] - Iteration 230 took 54s (37.77% Gen, 62.23% Train). Generation: 20s, Training: 34s. Estimated remaining time: 11h 37m 59s. Estimated total time: 15h 15m 18s. Time estimates for 10 more iterations: 9m 9s, 100 more iterations: 1h 31m 31s, 500 more iterations: 7h 37m 39s.
[2025-08-20 11:47:50,590][__main__][INFO] - Starting iteration 230.
[2025-08-20 11:48:13,894][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 11:48:13,895][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 11:48:13,901][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 11:48:16,388][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 11:48:16,389][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 11:48:16,396][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 11:48:16,398][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data.
[2025-08-20 11:48:16,399][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets.
[2025-08-20 11:48:16,705][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:48:17,496][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:48:18,291][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:48:19,086][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:48:19,880][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:48:20,674][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:48:21,471][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:48:22,268][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:48:23,062][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:48:23,856][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:48:24,652][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:48:25,450][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:48:26,247][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:48:27,041][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:48:27,835][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:48:28,631][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:48:29,430][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:48:30,228][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:48:31,023][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:48:31,816][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:48:32,610][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:48:33,944][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:48:34,737][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:48:35,529][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:48:36,325][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:48:37,120][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:48:37,913][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:48:38,708][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:48:39,504][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:48:40,299][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:48:41,092][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:48:41,886][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:48:43,474][mllm.training.trainer_common][INFO] - For task: Apply reinforce step
,
            ΔVRAM Allocated: 2.288818359375e-05 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:27,
            Percentage of VRAM taken: 69.91579512747062%,
            
[2025-08-20 11:48:45,804][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/policy_optimizer_state.pt
[2025-08-20 11:48:45,805][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/critic_optimizer_state.pt
[2025-08-20 11:48:47,081][__main__][INFO] - Iteration 231 took 56s (36.90% Gen, 63.10% Train). Generation: 20s, Training: 35s. Estimated remaining time: 12h 3m 15s. Estimated total time: 15h 41m 31s. Time estimates for 10 more iterations: 9m 24s, 100 more iterations: 1h 34m 9s, 500 more iterations: 7h 50m 45s.
[2025-08-20 11:48:48,778][__main__][INFO] - Starting iteration 231.
[2025-08-20 11:49:12,214][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 11:49:12,215][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 11:49:12,221][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 11:49:14,684][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 11:49:14,685][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 11:49:14,692][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 11:49:14,694][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data.
[2025-08-20 11:49:14,695][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets.
[2025-08-20 11:49:14,998][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:49:15,791][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:49:16,583][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:49:17,376][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:49:18,173][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:49:18,968][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:49:19,761][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:49:20,556][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:49:21,353][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:49:22,494][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:49:24,877][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:49:25,670][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:49:26,462][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:49:27,253][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:49:28,048][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:49:28,842][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:49:29,634][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:49:30,429][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:49:31,224][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:49:32,017][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:49:32,810][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:49:33,607][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:49:34,401][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:49:35,194][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:49:36,435][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:49:37,229][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:49:38,026][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:49:38,820][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:49:39,616][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:49:40,409][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:49:41,207][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:49:42,003][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:49:43,601][mllm.training.trainer_common][INFO] - For task: Apply reinforce step
,
            ΔVRAM Allocated: 2.288818359375e-05 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:28,
            Percentage of VRAM taken: 69.91579512747062%,
            
[2025-08-20 11:49:44,544][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/policy_optimizer_state.pt
[2025-08-20 11:49:44,546][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/critic_optimizer_state.pt
[2025-08-20 11:49:45,944][__main__][INFO] - Iteration 232 took 57s (36.69% Gen, 63.30% Train). Generation: 20s, Training: 36s. Estimated remaining time: 12h 13m 20s. Estimated total time: 15h 52m 35s. Time estimates for 10 more iterations: 9m 31s, 100 more iterations: 1h 35m 15s, 500 more iterations: 7h 56m 17s.
[2025-08-20 11:49:45,947][__main__][INFO] - Starting iteration 232.
[2025-08-20 11:50:09,231][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 11:50:09,232][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 11:50:09,239][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 11:50:11,733][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 11:50:11,734][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 11:50:11,741][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 11:50:11,743][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data.
[2025-08-20 11:50:11,744][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets.
[2025-08-20 11:50:12,043][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:50:12,834][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:50:13,628][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:50:14,420][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:50:15,216][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:50:16,012][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:50:16,805][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:50:17,599][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:50:18,396][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:50:19,193][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:50:19,988][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:50:20,780][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:50:21,574][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:50:22,369][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:50:23,161][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:50:23,954][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:50:24,750][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:50:25,544][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:50:26,337][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:50:27,133][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:50:27,929][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:50:29,139][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:50:29,933][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:50:30,730][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:50:31,525][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:50:32,319][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:50:33,113][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:50:33,909][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:50:34,704][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:50:35,498][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:50:36,292][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:50:37,089][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:50:38,660][mllm.training.trainer_common][INFO] - For task: Apply reinforce step
,
            ΔVRAM Allocated: 2.288818359375e-05 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:26,
            Percentage of VRAM taken: 69.91579512747062%,
            
[2025-08-20 11:50:39,709][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/policy_optimizer_state.pt
[2025-08-20 11:50:39,711][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/critic_optimizer_state.pt
[2025-08-20 11:50:41,172][__main__][INFO] - Iteration 233 took 55s (37.65% Gen, 62.34% Train). Generation: 20s, Training: 34s. Estimated remaining time: 11h 40m 12s. Estimated total time: 15h 20m 22s. Time estimates for 10 more iterations: 9m 12s, 100 more iterations: 1h 32m 2s, 500 more iterations: 7h 40m 11s.
[2025-08-20 11:50:41,174][__main__][INFO] - Starting iteration 233.
[2025-08-20 11:51:04,524][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 11:51:04,526][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 11:51:04,532][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 11:51:06,988][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 11:51:06,989][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 11:51:06,995][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 11:51:06,997][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data.
[2025-08-20 11:51:06,998][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets.
[2025-08-20 11:51:07,297][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:51:08,086][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:51:08,877][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:51:09,670][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:51:10,461][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:51:11,256][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:51:12,049][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:51:12,841][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:51:13,636][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:51:14,428][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:51:15,219][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:51:16,013][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:51:16,804][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:51:17,596][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:51:18,388][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:51:19,179][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:51:19,974][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:51:20,766][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:51:22,011][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:51:22,806][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:51:23,600][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:51:24,395][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:51:25,191][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:51:25,985][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:51:26,778][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:51:27,574][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:51:28,371][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:51:29,167][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:51:29,961][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:51:30,755][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:51:31,551][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:51:32,347][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:51:33,973][mllm.training.trainer_common][INFO] - For task: Apply reinforce step
,
            ΔVRAM Allocated: 2.288818359375e-05 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:26,
            Percentage of VRAM taken: 69.91579512747062%,
            
[2025-08-20 11:51:34,934][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/policy_optimizer_state.pt
[2025-08-20 11:51:34,936][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/critic_optimizer_state.pt
[2025-08-20 11:51:36,298][__main__][INFO] - Iteration 234 took 55s (37.91% Gen, 62.09% Train). Generation: 20s, Training: 34s. Estimated remaining time: 11h 37m 38s. Estimated total time: 15h 18m 44s. Time estimates for 10 more iterations: 9m 11s, 100 more iterations: 1h 31m 52s, 500 more iterations: 7h 39m 22s.
[2025-08-20 11:51:36,313][__main__][INFO] - Starting iteration 234.
[2025-08-20 11:51:59,916][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 11:51:59,918][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 11:51:59,924][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 11:52:02,377][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 11:52:02,378][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 11:52:02,384][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 11:52:02,387][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data.
[2025-08-20 11:52:02,387][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets.
[2025-08-20 11:52:02,685][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:52:03,477][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:52:04,267][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:52:05,060][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:52:05,854][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:52:06,646][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:52:07,440][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:52:08,235][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:52:09,026][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:52:09,818][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:52:10,611][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:52:11,402][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:52:12,196][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:52:12,987][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:52:13,779][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:52:14,572][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:52:15,364][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:52:16,158][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:52:16,951][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:52:17,743][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:52:18,536][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:52:19,761][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:52:20,558][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:52:21,355][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:52:22,150][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:52:22,943][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:52:23,737][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:52:24,534][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:52:25,330][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:52:26,124][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:52:26,919][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:52:27,716][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:52:29,325][mllm.training.trainer_common][INFO] - For task: Apply reinforce step
,
            ΔVRAM Allocated: 2.288818359375e-05 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:26,
            Percentage of VRAM taken: 69.91579512747062%,
            
[2025-08-20 11:52:30,248][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/policy_optimizer_state.pt
[2025-08-20 11:52:30,250][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/critic_optimizer_state.pt
[2025-08-20 11:52:31,605][__main__][INFO] - Iteration 235 took 55s (38.23% Gen, 61.77% Train). Generation: 21s, Training: 34s. Estimated remaining time: 11h 39m 30s. Estimated total time: 15h 21m 30s. Time estimates for 10 more iterations: 9m 12s, 100 more iterations: 1h 32m 9s, 500 more iterations: 7h 40m 45s.
[2025-08-20 11:52:31,607][__main__][INFO] - Starting iteration 235.
[2025-08-20 11:52:55,004][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 11:52:55,005][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 11:52:55,012][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 11:52:57,474][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 11:52:57,475][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 11:52:57,482][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 11:52:57,484][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data.
[2025-08-20 11:52:57,484][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets.
[2025-08-20 11:52:57,789][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:52:58,582][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:52:59,376][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:53:00,168][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:53:00,965][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:53:01,761][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:53:02,554][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:53:03,348][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:53:04,145][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:53:04,943][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:53:05,740][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:53:06,533][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:53:07,326][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:53:08,122][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:53:08,916][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:53:09,709][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:53:10,502][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:53:11,299][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:53:12,094][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:53:12,888][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:53:13,682][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:53:14,479][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:53:15,275][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:53:16,069][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:53:16,863][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:53:18,091][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:53:18,886][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:53:19,680][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:53:20,477][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:53:21,273][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:53:22,067][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:53:22,861][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:53:24,444][mllm.training.trainer_common][INFO] - For task: Apply reinforce step
,
            ΔVRAM Allocated: 2.288818359375e-05 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:26,
            Percentage of VRAM taken: 69.91579512747062%,
            
[2025-08-20 11:53:25,409][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/policy_optimizer_state.pt
[2025-08-20 11:53:25,411][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/critic_optimizer_state.pt
[2025-08-20 11:53:26,788][__main__][INFO] - Iteration 236 took 55s (37.95% Gen, 62.05% Train). Generation: 20s, Training: 34s. Estimated remaining time: 11h 36m 45s. Estimated total time: 15h 19m 40s. Time estimates for 10 more iterations: 9m 11s, 100 more iterations: 1h 31m 58s, 500 more iterations: 7h 39m 50s.
[2025-08-20 11:53:26,790][__main__][INFO] - Starting iteration 236.
[2025-08-20 11:53:50,180][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 11:53:50,181][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 11:53:50,188][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 11:53:52,642][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 11:53:52,643][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 11:53:52,650][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 11:53:52,652][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data.
[2025-08-20 11:53:52,652][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets.
[2025-08-20 11:53:52,953][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:53:53,742][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:53:54,533][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:53:55,325][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:53:56,118][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:53:56,909][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:53:57,704][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:53:58,497][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:53:59,289][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:54:00,084][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:54:00,878][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:54:01,669][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:54:02,462][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:54:03,253][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:54:04,044][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:54:04,837][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:54:05,629][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:54:06,423][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:54:07,216][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:54:08,007][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:54:08,801][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:54:09,595][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:54:10,445][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:54:11,330][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:54:12,214][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:54:13,574][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:54:14,398][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:54:15,190][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:54:15,982][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:54:16,778][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:54:17,571][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:54:18,365][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:54:19,924][mllm.training.trainer_common][INFO] - For task: Apply reinforce step
,
            ΔVRAM Allocated: 2.288818359375e-05 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:27,
            Percentage of VRAM taken: 69.91579512747062%,
            
[2025-08-20 11:54:20,864][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/policy_optimizer_state.pt
[2025-08-20 11:54:20,865][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/critic_optimizer_state.pt
[2025-08-20 11:54:22,496][__main__][INFO] - Iteration 237 took 55s (37.56% Gen, 62.44% Train). Generation: 20s, Training: 34s. Estimated remaining time: 11h 44m 33s. Estimated total time: 15h 28m 25s. Time estimates for 10 more iterations: 9m 17s, 100 more iterations: 1h 32m 50s, 500 more iterations: 7h 44m 12s.
[2025-08-20 11:54:22,497][__main__][INFO] - Starting iteration 237.
[2025-08-20 11:54:45,815][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 11:54:45,816][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 11:54:45,822][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 11:54:48,277][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 11:54:48,278][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 11:54:48,284][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 11:54:48,286][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data.
[2025-08-20 11:54:48,287][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets.
[2025-08-20 11:54:48,585][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:54:49,375][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:54:50,170][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:54:50,965][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:54:51,760][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:54:52,553][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:54:53,350][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:54:54,147][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:54:54,943][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:54:55,735][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:54:56,529][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:54:57,324][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:54:58,117][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:54:58,910][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:54:59,707][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:55:00,502][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:55:01,295][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:55:02,089][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:55:03,319][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:55:04,111][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:55:04,903][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:55:05,695][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:55:06,486][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:55:07,281][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:55:08,073][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:55:08,864][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:55:09,658][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:55:10,449][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:55:11,243][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:55:12,036][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:55:12,828][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:55:13,622][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:55:15,225][mllm.training.trainer_common][INFO] - For task: Apply reinforce step
,
            ΔVRAM Allocated: 2.288818359375e-05 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:26,
            Percentage of VRAM taken: 69.91579512747062%,
            
[2025-08-20 11:55:16,243][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/policy_optimizer_state.pt
[2025-08-20 11:55:16,245][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/critic_optimizer_state.pt
[2025-08-20 11:55:17,610][__main__][INFO] - Iteration 238 took 55s (37.83% Gen, 62.17% Train). Generation: 20s, Training: 34s. Estimated remaining time: 11h 33m 46s. Estimated total time: 15h 18m 32s. Time estimates for 10 more iterations: 9m 11s, 100 more iterations: 1h 31m 51s, 500 more iterations: 7h 39m 16s.
[2025-08-20 11:55:17,612][__main__][INFO] - Starting iteration 238.
[2025-08-20 11:55:40,970][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 11:55:40,971][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 11:55:40,977][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 11:55:43,447][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 11:55:43,448][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 11:55:43,454][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 11:55:43,457][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data.
[2025-08-20 11:55:43,457][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets.
[2025-08-20 11:55:43,756][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:55:44,549][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:55:45,342][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:55:46,135][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:55:46,932][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:55:47,727][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:55:48,521][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:55:49,315][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:55:50,112][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:55:50,908][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:55:51,702][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:55:52,493][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:55:53,289][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:55:54,082][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:55:54,875][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:55:55,670][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:55:56,466][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:55:57,260][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:55:58,053][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:55:58,849][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:55:59,646][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:56:00,440][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:56:01,693][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:56:02,489][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:56:03,285][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:56:04,080][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:56:04,874][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:56:05,668][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:56:06,466][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:56:07,262][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:56:08,056][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:56:08,849][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:56:10,406][mllm.training.trainer_common][INFO] - For task: Apply reinforce step
,
            ΔVRAM Allocated: 2.288818359375e-05 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:26,
            Percentage of VRAM taken: 69.91579512747062%,
            
[2025-08-20 11:56:11,341][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/policy_optimizer_state.pt
[2025-08-20 11:56:11,343][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/critic_optimizer_state.pt
[2025-08-20 11:56:12,815][__main__][INFO] - Iteration 239 took 55s (37.85% Gen, 62.15% Train). Generation: 20s, Training: 34s. Estimated remaining time: 11h 34m 21s. Estimated total time: 15h 20m 2s. Time estimates for 10 more iterations: 9m 12s, 100 more iterations: 1h 32m 0s, 500 more iterations: 7h 40m 1s.
[2025-08-20 11:56:12,817][__main__][INFO] - Starting iteration 239.
[2025-08-20 11:56:37,694][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 11:56:37,695][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 11:56:37,701][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 11:56:40,181][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 11:56:40,182][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 11:56:40,189][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 11:56:40,191][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data.
[2025-08-20 11:56:40,191][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets.
[2025-08-20 11:56:40,493][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:56:41,284][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:56:42,076][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:56:42,871][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:56:43,664][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:56:44,457][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:56:45,254][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:56:46,048][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:56:46,841][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:56:47,635][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:56:48,430][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:56:49,222][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:56:50,014][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:56:50,810][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:56:51,602][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:56:52,395][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:56:53,191][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:56:53,985][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:56:54,778][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:56:55,572][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:56:56,368][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:56:57,162][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:56:57,955][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:56:59,260][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:57:00,053][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:57:00,849][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:57:01,645][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:57:02,438][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:57:03,232][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:57:04,028][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:57:04,824][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:57:05,619][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:57:07,198][mllm.training.trainer_common][INFO] - For task: Apply reinforce step
,
            ΔVRAM Allocated: 2.288818359375e-05 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:26,
            Percentage of VRAM taken: 69.91579512747062%,
            
[2025-08-20 11:57:08,147][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/policy_optimizer_state.pt
[2025-08-20 11:57:08,149][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/critic_optimizer_state.pt
[2025-08-20 11:57:09,604][__main__][INFO] - Iteration 240 took 56s (39.43% Gen, 60.57% Train). Generation: 22s, Training: 34s. Estimated remaining time: 11h 59m 48s. Estimated total time: 15h 46m 26s. Time estimates for 10 more iterations: 9m 27s, 100 more iterations: 1h 34m 38s, 500 more iterations: 7h 53m 13s.
[2025-08-20 11:57:09,605][__main__][INFO] - Starting iteration 240.
[2025-08-20 11:57:32,983][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 11:57:32,985][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 11:57:32,991][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 11:57:35,461][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 11:57:35,462][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 11:57:35,469][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 11:57:35,472][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data.
[2025-08-20 11:57:35,472][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets.
[2025-08-20 11:57:35,770][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:57:36,561][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:57:37,356][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:57:38,150][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:57:38,944][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:57:39,737][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:57:40,534][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:57:41,328][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:57:42,123][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:57:42,917][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:57:43,715][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:57:44,510][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:57:45,302][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:57:46,096][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:57:46,891][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:57:47,683][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:57:48,476][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:57:49,272][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:57:50,067][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:57:50,860][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:57:51,655][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:57:52,452][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:57:53,248][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:57:54,041][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:57:55,272][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:57:56,069][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:57:56,863][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:57:57,657][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:57:58,452][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:57:59,249][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:58:00,045][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:58:00,839][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:58:02,428][mllm.training.trainer_common][INFO] - For task: Apply reinforce step
,
            ΔVRAM Allocated: 2.288818359375e-05 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:26,
            Percentage of VRAM taken: 69.91579512747062%,
            
[2025-08-20 11:58:03,526][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/policy_optimizer_state.pt
[2025-08-20 11:58:03,557][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/critic_optimizer_state.pt
[2025-08-20 11:58:04,939][__main__][INFO] - Iteration 241 took 55s (37.77% Gen, 62.22% Train). Generation: 20s, Training: 34s. Estimated remaining time: 11h 34m 39s. Estimated total time: 15h 22m 13s. Time estimates for 10 more iterations: 9m 13s, 100 more iterations: 1h 32m 13s, 500 more iterations: 7h 41m 6s.
[2025-08-20 11:58:04,941][__main__][INFO] - Starting iteration 241.
[2025-08-20 11:58:28,280][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 11:58:28,281][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 11:58:28,288][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 11:58:30,744][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 11:58:30,746][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 11:58:30,752][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 11:58:30,755][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data.
[2025-08-20 11:58:30,755][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets.
[2025-08-20 11:58:31,054][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:58:31,844][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:58:32,640][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:58:33,436][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:58:34,230][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:58:35,023][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:58:35,820][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:58:36,616][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:58:37,411][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:58:38,205][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:58:38,999][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:58:39,794][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:58:40,587][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:58:41,380][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:58:42,176][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:58:42,970][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:58:43,763][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:58:44,558][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:58:45,355][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:58:46,150][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:58:46,943][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:58:47,737][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:58:48,978][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:58:49,775][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:58:50,571][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:58:51,364][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:58:52,158][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:58:52,954][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:58:53,750][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:58:54,543][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:58:55,337][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:58:56,134][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:58:57,702][mllm.training.trainer_common][INFO] - For task: Apply reinforce step
,
            ΔVRAM Allocated: 2.288818359375e-05 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:26,
            Percentage of VRAM taken: 69.91579512747062%,
            
[2025-08-20 11:58:58,658][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/policy_optimizer_state.pt
[2025-08-20 11:58:58,659][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/critic_optimizer_state.pt
[2025-08-20 11:59:00,121][__main__][INFO] - Iteration 242 took 55s (37.83% Gen, 62.17% Train). Generation: 20s, Training: 34s. Estimated remaining time: 11h 31m 10s. Estimated total time: 15h 19m 39s. Time estimates for 10 more iterations: 9m 11s, 100 more iterations: 1h 31m 57s, 500 more iterations: 7h 39m 49s.
[2025-08-20 11:59:00,123][__main__][INFO] - Starting iteration 242.
[2025-08-20 11:59:23,342][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 11:59:23,344][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 11:59:23,350][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 11:59:25,817][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 11:59:25,818][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 11:59:25,825][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 11:59:25,827][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data.
[2025-08-20 11:59:25,828][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets.
[2025-08-20 11:59:26,129][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:59:26,922][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:59:27,717][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:59:28,510][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:59:29,305][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:59:30,102][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:59:30,899][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:59:31,694][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:59:32,489][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:59:33,284][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:59:34,081][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:59:34,876][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:59:35,669][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:59:36,462][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:59:37,258][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:59:38,051][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:59:38,844][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:59:39,640][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:59:40,437][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:59:41,231][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:59:42,522][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:59:43,313][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:59:44,105][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:59:44,898][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:59:45,692][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:59:46,483][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:59:47,277][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:59:48,070][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:59:48,863][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:59:49,656][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:59:50,451][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:59:51,245][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 11:59:52,816][mllm.training.trainer_common][INFO] - For task: Apply reinforce step
,
            ΔVRAM Allocated: 2.288818359375e-05 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:26,
            Percentage of VRAM taken: 69.91579512747062%,
            
[2025-08-20 11:59:53,924][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/policy_optimizer_state.pt
[2025-08-20 11:59:53,927][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/critic_optimizer_state.pt
[2025-08-20 11:59:55,659][__main__][INFO] - Iteration 243 took 55s (37.41% Gen, 62.59% Train). Generation: 20s, Training: 34s. Estimated remaining time: 11h 36m 11s. Estimated total time: 15h 25m 36s. Time estimates for 10 more iterations: 9m 15s, 100 more iterations: 1h 32m 33s, 500 more iterations: 7h 42m 48s.
[2025-08-20 11:59:55,661][__main__][INFO] - Starting iteration 243.
[2025-08-20 12:00:19,582][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 12:00:19,584][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 12:00:19,590][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 12:00:22,045][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 12:00:22,046][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 12:00:22,053][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 12:00:22,055][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data.
[2025-08-20 12:00:22,056][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets.
[2025-08-20 12:00:22,354][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:00:23,144][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:00:23,937][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:00:24,730][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:00:25,525][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:00:26,320][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:00:27,114][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:00:27,908][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:00:28,704][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:00:29,500][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:00:30,292][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:00:31,085][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:00:31,880][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:00:32,673][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:00:33,465][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:00:34,261][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:00:35,056][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:00:35,849][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:00:36,644][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:00:37,438][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:00:38,232][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:00:39,024][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:00:39,821][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:00:40,617][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:00:41,411][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:00:42,204][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:00:43,001][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:00:43,797][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:00:45,026][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:00:45,821][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:00:46,618][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:00:47,414][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:00:49,004][mllm.training.trainer_common][INFO] - For task: Apply reinforce step
,
            ΔVRAM Allocated: 2.288818359375e-05 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:26,
            Percentage of VRAM taken: 69.91579512747062%,
            
[2025-08-20 12:00:49,955][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/policy_optimizer_state.pt
[2025-08-20 12:00:49,957][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/critic_optimizer_state.pt
[2025-08-20 12:00:51,298][__main__][INFO] - Iteration 244 took 55s (38.58% Gen, 61.42% Train). Generation: 21s, Training: 34s. Estimated remaining time: 11h 36m 56s. Estimated total time: 15h 27m 16s. Time estimates for 10 more iterations: 9m 16s, 100 more iterations: 1h 32m 43s, 500 more iterations: 7h 43m 38s.
[2025-08-20 12:00:51,299][__main__][INFO] - Starting iteration 244.
[2025-08-20 12:01:14,611][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 12:01:14,612][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 12:01:14,618][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 12:01:17,076][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 12:01:17,077][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 12:01:17,084][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 12:01:17,086][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data.
[2025-08-20 12:01:17,087][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets.
[2025-08-20 12:01:17,385][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:01:18,175][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:01:18,969][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:01:19,764][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:01:20,558][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:01:21,351][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:01:22,148][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:01:22,944][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:01:23,738][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:01:24,530][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:01:25,325][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:01:26,119][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:01:26,911][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:01:27,706][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:01:28,501][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:01:29,295][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:01:30,088][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:01:30,884][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:01:31,680][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:01:32,473][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:01:33,266][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:01:34,063][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:01:34,858][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:01:36,063][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:01:36,859][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:01:37,652][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:01:38,446][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:01:39,242][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:01:40,038][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:01:40,832][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:01:41,626][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:01:42,421][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:01:44,023][mllm.training.trainer_common][INFO] - For task: Apply reinforce step
,
            ΔVRAM Allocated: 2.288818359375e-05 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:26,
            Percentage of VRAM taken: 69.91579512747062%,
            
[2025-08-20 12:01:44,961][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/policy_optimizer_state.pt
[2025-08-20 12:01:44,963][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/critic_optimizer_state.pt
[2025-08-20 12:01:46,353][__main__][INFO] - Iteration 245 took 55s (37.87% Gen, 62.12% Train). Generation: 20s, Training: 34s. Estimated remaining time: 11h 26m 18s. Estimated total time: 15h 17m 34s. Time estimates for 10 more iterations: 9m 10s, 100 more iterations: 1h 31m 45s, 500 more iterations: 7h 38m 47s.
[2025-08-20 12:01:46,355][__main__][INFO] - Starting iteration 245.
[2025-08-20 12:02:09,649][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 12:02:09,650][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 12:02:09,657][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 12:02:12,109][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 12:02:12,110][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 12:02:12,117][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 12:02:12,119][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data.
[2025-08-20 12:02:12,119][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets.
[2025-08-20 12:02:12,419][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:02:13,210][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:02:14,003][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:02:14,796][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:02:15,592][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:02:16,388][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:02:17,181][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:02:17,975][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:02:18,769][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:02:19,564][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:02:20,356][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:02:21,150][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:02:21,945][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:02:22,738][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:02:23,531][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:02:24,327][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:02:25,121][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:02:25,915][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:02:26,710][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:02:27,507][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:02:28,303][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:02:29,098][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:02:30,362][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:02:31,156][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:02:31,948][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:02:32,745][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:02:33,540][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:02:34,332][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:02:35,126][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:02:35,923][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:02:36,718][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:02:37,511][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:02:39,091][mllm.training.trainer_common][INFO] - For task: Apply reinforce step
,
            ΔVRAM Allocated: 2.288818359375e-05 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:26,
            Percentage of VRAM taken: 69.91579512747062%,
            
[2025-08-20 12:02:40,046][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/policy_optimizer_state.pt
[2025-08-20 12:02:40,048][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/critic_optimizer_state.pt
[2025-08-20 12:02:41,623][__main__][INFO] - Iteration 246 took 55s (37.67% Gen, 62.33% Train). Generation: 20s, Training: 34s. Estimated remaining time: 11h 28m 57s. Estimated total time: 15h 21m 7s. Time estimates for 10 more iterations: 9m 12s, 100 more iterations: 1h 32m 6s, 500 more iterations: 7h 40m 33s.
[2025-08-20 12:02:41,625][__main__][INFO] - Starting iteration 246.
[2025-08-20 12:03:05,375][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 12:03:05,376][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 12:03:05,382][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 12:03:07,833][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 12:03:07,834][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 12:03:07,841][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 12:03:07,844][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data.
[2025-08-20 12:03:07,844][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets.
[2025-08-20 12:03:08,142][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:03:08,933][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:03:09,727][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:03:10,519][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:03:11,314][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:03:12,112][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:03:12,907][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:03:13,700][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:03:14,495][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:03:15,292][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:03:16,088][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:03:16,882][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:03:17,675][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:03:18,470][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:03:19,264][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:03:20,057][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:03:20,851][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:03:21,648][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:03:22,442][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:03:23,235][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:03:24,030][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:03:24,827][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:03:25,622][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:03:26,415][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:03:27,626][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:03:28,421][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:03:29,215][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:03:30,010][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:03:30,807][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:03:31,603][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:03:32,397][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:03:33,191][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:03:34,816][mllm.training.trainer_common][INFO] - For task: Apply reinforce step
,
            ΔVRAM Allocated: 2.288818359375e-05 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:26,
            Percentage of VRAM taken: 69.91579512747062%,
            
[2025-08-20 12:03:35,774][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/policy_optimizer_state.pt
[2025-08-20 12:03:35,776][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/critic_optimizer_state.pt
[2025-08-20 12:03:37,539][__main__][INFO] - Iteration 247 took 55s (38.07% Gen, 61.92% Train). Generation: 21s, Training: 34s. Estimated remaining time: 11h 38m 47s. Estimated total time: 15h 31m 53s. Time estimates for 10 more iterations: 9m 19s, 100 more iterations: 1h 33m 11s, 500 more iterations: 7h 45m 56s.
[2025-08-20 12:03:37,540][__main__][INFO] - Starting iteration 247.
[2025-08-20 12:04:00,763][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 12:04:00,765][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 12:04:00,771][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 12:04:03,212][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 12:04:03,213][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 12:04:03,219][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 12:04:03,222][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data.
[2025-08-20 12:04:03,222][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets.
[2025-08-20 12:04:03,521][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:04:04,314][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:04:05,107][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:04:05,900][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:04:06,697][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:04:07,494][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:04:08,290][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:04:09,083][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:04:09,878][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:04:10,675][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:04:11,473][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:04:12,268][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:04:13,061][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:04:13,855][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:04:14,650][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:04:15,443][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:04:16,671][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:04:17,465][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:04:18,257][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:04:19,051][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:04:19,845][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:04:20,637][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:04:21,431][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:04:22,226][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:04:23,018][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:04:23,812][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:04:24,608][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:04:25,402][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:04:26,196][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:04:26,991][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:04:27,787][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:04:28,582][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:04:30,150][mllm.training.trainer_common][INFO] - For task: Apply reinforce step
,
            ΔVRAM Allocated: 2.288818359375e-05 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:26,
            Percentage of VRAM taken: 69.91579512747062%,
            
[2025-08-20 12:04:31,094][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/policy_optimizer_state.pt
[2025-08-20 12:04:31,095][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/critic_optimizer_state.pt
[2025-08-20 12:04:33,117][__main__][INFO] - Iteration 248 took 55s (37.41% Gen, 62.59% Train). Generation: 20s, Training: 34s. Estimated remaining time: 11h 32m 14s. Estimated total time: 15h 26m 16s. Time estimates for 10 more iterations: 9m 15s, 100 more iterations: 1h 32m 37s, 500 more iterations: 7h 43m 8s.
[2025-08-20 12:04:33,119][__main__][INFO] - Starting iteration 248.
[2025-08-20 12:04:56,355][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 12:04:56,356][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 12:04:56,363][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 12:04:58,839][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 12:04:58,841][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 12:04:58,847][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 12:04:58,849][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data.
[2025-08-20 12:04:58,850][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets.
[2025-08-20 12:04:59,150][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:04:59,939][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:05:00,731][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:05:01,523][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:05:02,317][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:05:03,110][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:05:03,902][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:05:04,697][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:05:05,491][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:05:06,283][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:05:07,077][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:05:07,870][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:05:08,662][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:05:09,455][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:05:10,247][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:05:11,038][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:05:11,833][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:05:12,626][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:05:13,418][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:05:14,213][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:05:15,007][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:05:16,298][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:05:17,095][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:05:17,891][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:05:18,685][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:05:19,478][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:05:20,275][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:05:21,072][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:05:21,869][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:05:22,663][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:05:23,456][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:05:24,253][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:05:25,826][mllm.training.trainer_common][INFO] - For task: Apply reinforce step
,
            ΔVRAM Allocated: 2.288818359375e-05 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:26,
            Percentage of VRAM taken: 69.91579512747062%,
            
[2025-08-20 12:05:26,790][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/policy_optimizer_state.pt
[2025-08-20 12:05:26,792][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/critic_optimizer_state.pt
[2025-08-20 12:05:28,156][__main__][INFO] - Iteration 249 took 55s (37.75% Gen, 62.25% Train). Generation: 20s, Training: 34s. Estimated remaining time: 11h 22m 19s. Estimated total time: 15h 17m 16s. Time estimates for 10 more iterations: 9m 10s, 100 more iterations: 1h 31m 43s, 500 more iterations: 7h 38m 38s.
[2025-08-20 12:05:28,158][__main__][INFO] - Starting iteration 249.
[2025-08-20 12:05:51,810][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 12:05:51,811][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 12:05:51,817][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 12:05:54,250][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 12:05:54,251][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 12:05:54,258][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 12:05:54,260][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data.
[2025-08-20 12:05:54,261][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets.
[2025-08-20 12:05:54,559][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:05:55,351][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:05:56,144][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:05:56,940][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:05:57,736][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:05:58,529][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:05:59,324][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:06:00,121][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:06:00,918][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:06:01,712][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:06:02,507][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:06:03,302][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:06:04,098][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:06:04,891][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:06:05,684][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:06:06,477][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:06:07,272][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:06:08,065][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:06:09,273][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:06:10,066][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:06:10,860][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:06:11,654][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:06:12,448][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:06:13,241][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:06:14,037][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:06:14,834][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:06:15,629][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:06:16,422][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:06:17,217][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:06:18,014][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:06:18,810][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:06:19,604][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:06:21,173][mllm.training.trainer_common][INFO] - For task: Apply reinforce step
,
            ΔVRAM Allocated: 2.288818359375e-05 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:26,
            Percentage of VRAM taken: 69.91579512747062%,
            
[2025-08-20 12:06:22,165][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/policy_optimizer_state.pt
[2025-08-20 12:06:22,166][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/critic_optimizer_state.pt
[2025-08-20 12:06:23,526][__main__][INFO] - Iteration 250 took 55s (38.34% Gen, 61.66% Train). Generation: 21s, Training: 34s. Estimated remaining time: 11h 26m 54s. Estimated total time: 15h 22m 47s. Time estimates for 10 more iterations: 9m 13s, 100 more iterations: 1h 32m 16s, 500 more iterations: 7h 41m 23s.
[2025-08-20 12:06:23,527][__main__][INFO] - Starting iteration 250.
[2025-08-20 12:06:47,168][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 12:06:47,169][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 12:06:47,176][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 12:06:49,648][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 12:06:49,649][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 12:06:49,655][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 12:06:49,658][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data.
[2025-08-20 12:06:49,658][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets.
[2025-08-20 12:06:49,957][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:06:50,747][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:06:51,543][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:06:52,335][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:06:53,128][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:06:53,925][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:06:54,721][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:06:55,515][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:06:56,309][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:06:57,105][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:06:57,902][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:06:58,696][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:06:59,489][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:07:00,282][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:07:01,077][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:07:01,870][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:07:02,662][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:07:03,459][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:07:04,253][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:07:05,046][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:07:05,842][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:07:07,154][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:07:07,948][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:07:08,741][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:07:09,538][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:07:10,334][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:07:11,128][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:07:11,922][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:07:12,718][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:07:13,514][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:07:14,308][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:07:15,102][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:07:16,655][mllm.training.trainer_common][INFO] - For task: Apply reinforce step
,
            ΔVRAM Allocated: 2.288818359375e-05 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:26,
            Percentage of VRAM taken: 69.91579512747062%,
            
[2025-08-20 12:07:17,606][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/policy_optimizer_state.pt
[2025-08-20 12:07:17,607][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/critic_optimizer_state.pt
[2025-08-20 12:07:21,609][__main__][INFO] - Iteration 251 took 58s (36.48% Gen, 59.01% Train). Generation: 21s, Training: 34s. Estimated remaining time: 12h 11m 9s. Estimated total time: 16h 8m 0s. Time estimates for 10 more iterations: 9m 40s, 100 more iterations: 1h 36m 48s, 500 more iterations: 8h 4m 0s.
[2025-08-20 12:07:21,610][__main__][INFO] - Starting iteration 251.
[2025-08-20 12:07:44,844][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 12:07:44,845][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 12:07:44,851][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 12:07:47,347][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 12:07:47,348][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 12:07:47,355][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 12:07:47,357][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data.
[2025-08-20 12:07:47,358][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets.
[2025-08-20 12:07:47,656][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:07:48,446][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:07:49,239][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:07:50,031][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:07:50,826][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:07:51,622][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:07:52,415][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:07:53,208][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:07:54,003][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:07:54,797][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:07:55,588][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:07:56,383][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:07:57,177][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:07:57,969][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:07:58,764][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:07:59,560][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:08:00,353][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:08:01,146][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:08:01,942][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:08:02,737][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:08:03,530][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:08:04,324][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:08:05,121][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:08:05,916][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:08:07,216][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:08:08,009][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:08:08,802][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:08:09,599][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:08:10,394][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:08:11,188][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:08:11,982][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:08:12,780][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:08:14,342][mllm.training.trainer_common][INFO] - For task: Apply reinforce step
,
            ΔVRAM Allocated: 2.288818359375e-05 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:26,
            Percentage of VRAM taken: 69.91579512747062%,
            
[2025-08-20 12:08:15,285][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/policy_optimizer_state.pt
[2025-08-20 12:08:15,286][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/critic_optimizer_state.pt
[2025-08-20 12:08:16,618][__main__][INFO] - Iteration 252 took 55s (37.74% Gen, 62.25% Train). Generation: 20s, Training: 34s. Estimated remaining time: 11h 19m 2s. Estimated total time: 15h 16m 47s. Time estimates for 10 more iterations: 9m 10s, 100 more iterations: 1h 31m 40s, 500 more iterations: 7h 38m 23s.
[2025-08-20 12:08:16,620][__main__][INFO] - Starting iteration 252.
[2025-08-20 12:08:39,805][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 12:08:39,807][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 12:08:39,813][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 12:08:42,268][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 12:08:42,270][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 12:08:42,276][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 12:08:42,278][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data.
[2025-08-20 12:08:42,279][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets.
[2025-08-20 12:08:42,577][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:08:43,367][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:08:44,158][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:08:44,950][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:08:45,745][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:08:46,537][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:08:47,329][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:08:48,121][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:08:48,912][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:08:49,706][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:08:50,498][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:08:51,288][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:08:52,082][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:08:52,874][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:08:53,667][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:08:54,461][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:08:55,253][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:08:56,046][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:08:56,840][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:08:57,633][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:08:58,425][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:08:59,626][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:09:00,420][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:09:01,213][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:09:02,007][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:09:02,804][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:09:03,600][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:09:04,394][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:09:05,188][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:09:05,985][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:09:06,782][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:09:07,577][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:09:09,131][mllm.training.trainer_common][INFO] - For task: Apply reinforce step
,
            ΔVRAM Allocated: 2.288818359375e-05 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:26,
            Percentage of VRAM taken: 69.91579512747062%,
            
[2025-08-20 12:09:10,075][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/policy_optimizer_state.pt
[2025-08-20 12:09:10,077][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/critic_optimizer_state.pt
[2025-08-20 12:09:11,595][__main__][INFO] - Iteration 253 took 54s (37.74% Gen, 62.26% Train). Generation: 20s, Training: 34s. Estimated remaining time: 11h 17m 34s. Estimated total time: 15h 16m 15s. Time estimates for 10 more iterations: 9m 9s, 100 more iterations: 1h 31m 37s, 500 more iterations: 7h 38m 7s.
[2025-08-20 12:09:11,597][__main__][INFO] - Starting iteration 253.
[2025-08-20 12:09:34,958][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 12:09:34,959][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 12:09:34,966][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 12:09:37,419][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 12:09:37,420][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 12:09:37,427][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 12:09:37,430][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data.
[2025-08-20 12:09:37,430][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets.
[2025-08-20 12:09:37,729][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:09:38,521][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:09:39,314][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:09:40,111][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:09:40,906][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:09:41,700][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:09:42,494][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:09:43,290][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:09:44,083][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:09:44,875][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:09:45,670][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:09:46,464][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:09:47,256][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:09:48,051][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:09:48,847][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:09:49,641][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:09:50,434][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:09:51,230][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:09:52,027][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:09:52,822][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:09:53,616][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:09:54,410][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:09:55,652][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:09:56,448][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:09:57,245][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:09:58,040][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:09:59,043][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:09:59,837][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:10:00,630][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:10:01,426][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:10:02,223][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:10:03,016][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:10:04,619][mllm.training.trainer_common][INFO] - For task: Apply reinforce step
,
            ΔVRAM Allocated: 2.288818359375e-05 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:27,
            Percentage of VRAM taken: 69.91579512747062%,
            
[2025-08-20 12:10:05,586][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/policy_optimizer_state.pt
[2025-08-20 12:10:05,587][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/critic_optimizer_state.pt
[2025-08-20 12:10:07,060][__main__][INFO] - Iteration 254 took 55s (37.72% Gen, 62.28% Train). Generation: 20s, Training: 34s. Estimated remaining time: 11h 24m 46s. Estimated total time: 15h 24m 22s. Time estimates for 10 more iterations: 9m 14s, 100 more iterations: 1h 32m 26s, 500 more iterations: 7h 42m 11s.
[2025-08-20 12:10:07,062][__main__][INFO] - Starting iteration 254.
[2025-08-20 12:10:30,657][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 12:10:30,688][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 12:10:30,707][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 12:10:33,163][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 12:10:33,164][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 12:10:33,170][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 12:10:33,173][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data.
[2025-08-20 12:10:33,173][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets.
[2025-08-20 12:10:33,476][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:10:34,271][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:10:35,064][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:10:35,857][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:10:36,654][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:10:37,450][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:10:38,243][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:10:39,036][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:10:39,832][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:10:40,626][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:10:41,418][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:10:42,213][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:10:43,009][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:10:43,803][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:10:44,596][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:10:45,392][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:10:46,189][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:10:46,984][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:10:47,778][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:10:48,573][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:10:49,920][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:10:50,712][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:10:51,508][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:10:52,301][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:10:53,093][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:10:53,889][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:10:54,685][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:10:55,479][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:10:56,272][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:10:57,068][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:10:57,863][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:10:58,656][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:11:00,266][mllm.training.trainer_common][INFO] - For task: Apply reinforce step
,
            ΔVRAM Allocated: 2.288818359375e-05 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:27,
            Percentage of VRAM taken: 69.91579512747062%,
            
[2025-08-20 12:11:01,233][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/policy_optimizer_state.pt
[2025-08-20 12:11:01,234][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/critic_optimizer_state.pt
[2025-08-20 12:11:02,631][__main__][INFO] - Iteration 255 took 55s (38.04% Gen, 61.96% Train). Generation: 21s, Training: 34s. Estimated remaining time: 11h 25m 37s. Estimated total time: 15h 26m 8s. Time estimates for 10 more iterations: 9m 15s, 100 more iterations: 1h 32m 36s, 500 more iterations: 7h 43m 4s.
[2025-08-20 12:11:02,633][__main__][INFO] - Starting iteration 255.
[2025-08-20 12:11:26,019][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 12:11:26,021][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 12:11:26,027][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 12:11:28,500][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 12:11:28,501][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 12:11:28,508][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 12:11:28,510][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data.
[2025-08-20 12:11:28,511][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets.
[2025-08-20 12:11:28,809][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:11:29,600][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:11:30,396][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:11:31,191][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:11:31,985][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:11:32,779][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:11:33,575][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:11:34,368][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:11:35,161][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:11:35,955][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:11:36,750][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:11:37,542][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:11:38,336][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:11:39,133][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:11:39,928][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:11:40,721][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:11:41,515][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:11:42,313][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:11:43,108][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:11:43,901][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:11:44,695][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:11:45,492][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:11:46,288][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:11:47,494][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:11:48,290][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:11:49,086][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:11:49,879][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:11:50,673][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:11:51,470][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:11:52,266][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:11:53,060][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:11:53,854][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:11:55,416][mllm.training.trainer_common][INFO] - For task: Apply reinforce step
,
            ΔVRAM Allocated: 2.288818359375e-05 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:26,
            Percentage of VRAM taken: 69.91579512747062%,
            
[2025-08-20 12:11:56,384][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/policy_optimizer_state.pt
[2025-08-20 12:11:56,387][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/critic_optimizer_state.pt
[2025-08-20 12:11:58,327][__main__][INFO] - Iteration 256 took 55s (37.59% Gen, 62.41% Train). Generation: 20s, Training: 34s. Estimated remaining time: 11h 26m 41s. Estimated total time: 15h 28m 8s. Time estimates for 10 more iterations: 9m 16s, 100 more iterations: 1h 32m 48s, 500 more iterations: 7h 44m 4s.
[2025-08-20 12:11:58,329][__main__][INFO] - Starting iteration 256.
[2025-08-20 12:12:21,706][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 12:12:21,707][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 12:12:21,713][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 12:12:24,174][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 12:12:24,175][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 12:12:24,181][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 12:12:24,184][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data.
[2025-08-20 12:12:24,184][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets.
[2025-08-20 12:12:24,487][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:12:25,278][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:12:26,070][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:12:26,862][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:12:27,658][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:12:28,450][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:12:29,243][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:12:30,038][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:12:30,831][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:12:31,623][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:12:32,418][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:12:33,213][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:12:34,005][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:12:34,799][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:12:35,596][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:12:36,389][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:12:37,182][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:12:38,436][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:12:39,235][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:12:40,032][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:12:40,826][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:12:41,619][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:12:42,415][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:12:43,210][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:12:44,004][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:12:44,798][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:12:45,594][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:12:46,390][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:12:47,183][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:12:47,977][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:12:48,774][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:12:49,570][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:12:51,176][mllm.training.trainer_common][INFO] - For task: Apply reinforce step
,
            ΔVRAM Allocated: 2.288818359375e-05 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:26,
            Percentage of VRAM taken: 69.91579512747062%,
            
[2025-08-20 12:12:52,179][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/policy_optimizer_state.pt
[2025-08-20 12:12:52,181][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/critic_optimizer_state.pt
[2025-08-20 12:12:53,554][__main__][INFO] - Iteration 257 took 55s (37.88% Gen, 62.12% Train). Generation: 20s, Training: 34s. Estimated remaining time: 11h 18m 2s. Estimated total time: 15h 20m 24s. Time estimates for 10 more iterations: 9m 12s, 100 more iterations: 1h 32m 2s, 500 more iterations: 7h 40m 12s.
[2025-08-20 12:12:53,558][__main__][INFO] - Starting iteration 257.
[2025-08-20 12:13:18,566][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 12:13:18,568][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 12:13:18,575][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 12:13:21,044][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 12:13:21,045][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 12:13:21,052][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 12:13:21,054][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data.
[2025-08-20 12:13:21,054][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets.
[2025-08-20 12:13:21,362][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:13:22,155][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:13:22,946][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:13:23,741][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:13:24,533][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:13:25,325][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:13:26,121][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:13:26,914][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:13:27,706][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:13:28,500][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:13:29,296][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:13:30,088][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:13:30,882][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:13:31,679][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:13:32,473][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:13:33,266][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:13:34,059][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:13:34,854][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:13:36,117][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:13:36,910][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:13:37,703][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:13:38,497][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:13:39,293][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:13:40,087][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:13:40,880][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:13:41,676][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:13:42,472][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:13:43,265][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:13:44,059][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:13:44,856][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:13:45,653][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:13:46,448][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:13:48,065][mllm.training.trainer_common][INFO] - For task: Apply reinforce step
,
            ΔVRAM Allocated: 2.288818359375e-05 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:26,
            Percentage of VRAM taken: 69.91579512747062%,
            
[2025-08-20 12:13:49,036][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/policy_optimizer_state.pt
[2025-08-20 12:13:49,038][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/critic_optimizer_state.pt
[2025-08-20 12:13:50,402][__main__][INFO] - Iteration 258 took 56s (39.62% Gen, 60.38% Train). Generation: 22s, Training: 34s. Estimated remaining time: 11h 44m 4s. Estimated total time: 15h 47m 24s. Time estimates for 10 more iterations: 9m 28s, 100 more iterations: 1h 34m 44s, 500 more iterations: 7h 53m 42s.
[2025-08-20 12:13:50,405][__main__][INFO] - Starting iteration 258.
[2025-08-20 12:14:13,725][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 12:14:13,726][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 12:14:13,732][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 12:14:16,169][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 12:14:16,170][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 12:14:16,177][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 12:14:16,179][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data.
[2025-08-20 12:14:16,180][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets.
[2025-08-20 12:14:16,477][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:14:17,267][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:14:18,061][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:14:18,850][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:14:19,643][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:14:20,434][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:14:21,225][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:14:22,017][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:14:22,808][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:14:23,603][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:14:24,397][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:14:25,188][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:14:25,982][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:14:26,776][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:14:27,567][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:14:28,362][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:14:29,156][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:14:29,949][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:14:30,742][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:14:31,537][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:14:32,330][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:14:33,618][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:14:34,412][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:14:35,205][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:14:36,002][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:14:36,799][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:14:37,594][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:14:38,387][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:14:39,182][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:14:39,979][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:14:40,776][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:14:41,570][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:14:43,125][mllm.training.trainer_common][INFO] - For task: Apply reinforce step
,
            ΔVRAM Allocated: 2.288818359375e-05 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:26,
            Percentage of VRAM taken: 69.91579512747062%,
            
[2025-08-20 12:14:44,220][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/policy_optimizer_state.pt
[2025-08-20 12:14:44,222][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/critic_optimizer_state.pt
[2025-08-20 12:14:45,719][__main__][INFO] - Iteration 259 took 55s (37.71% Gen, 62.28% Train). Generation: 20s, Training: 34s. Estimated remaining time: 11h 17m 39s. Estimated total time: 15h 21m 53s. Time estimates for 10 more iterations: 9m 13s, 100 more iterations: 1h 32m 11s, 500 more iterations: 7h 40m 56s.
[2025-08-20 12:14:45,721][__main__][INFO] - Starting iteration 259.
[2025-08-20 12:15:09,089][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 12:15:09,091][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 12:15:09,097][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 12:15:11,563][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 12:15:11,565][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 12:15:11,571][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 12:15:11,574][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data.
[2025-08-20 12:15:11,574][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets.
[2025-08-20 12:15:11,874][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:15:12,666][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:15:13,456][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:15:14,248][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:15:15,041][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:15:15,833][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:15:16,627][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:15:17,420][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:15:18,213][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:15:19,007][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:15:19,802][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:15:20,595][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:15:21,389][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:15:22,186][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:15:22,982][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:15:23,775][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:15:24,568][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:15:25,364][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:15:26,160][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:15:27,432][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:15:28,225][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:15:29,020][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:15:29,812][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:15:30,604][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:15:31,399][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:15:32,191][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:15:32,983][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:15:33,779][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:15:34,572][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:15:35,365][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:15:36,161][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:15:36,956][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:15:38,518][mllm.training.trainer_common][INFO] - For task: Apply reinforce step
,
            ΔVRAM Allocated: 2.288818359375e-05 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:26,
            Percentage of VRAM taken: 69.91579512747062%,
            
[2025-08-20 12:15:39,552][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/policy_optimizer_state.pt
[2025-08-20 12:15:39,554][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/critic_optimizer_state.pt
[2025-08-20 12:15:41,264][__main__][INFO] - Iteration 260 took 55s (37.64% Gen, 62.35% Train). Generation: 20s, Training: 34s. Estimated remaining time: 11h 20m 32s. Estimated total time: 15h 25m 42s. Time estimates for 10 more iterations: 9m 15s, 100 more iterations: 1h 32m 34s, 500 more iterations: 7h 42m 51s.
[2025-08-20 12:15:41,266][__main__][INFO] - Starting iteration 260.
[2025-08-20 12:16:04,875][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 12:16:04,877][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 12:16:04,883][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 12:16:07,313][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 12:16:07,315][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 12:16:07,322][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 12:16:07,324][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data.
[2025-08-20 12:16:07,324][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets.
[2025-08-20 12:16:07,622][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:16:08,411][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:16:09,203][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:16:09,994][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:16:10,789][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:16:11,582][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:16:12,374][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:16:13,170][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:16:13,966][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:16:14,759][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:16:15,552][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:16:16,348][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:16:17,144][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:16:17,937][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:16:18,731][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:16:19,527][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:16:20,322][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:16:21,564][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:16:22,357][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:16:23,150][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:16:23,947][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:16:24,744][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:16:25,539][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:16:26,333][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:16:27,129][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:16:27,925][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:16:28,721][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:16:29,516][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:16:30,310][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:16:31,106][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:16:31,904][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:16:32,700][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:16:34,298][mllm.training.trainer_common][INFO] - For task: Apply reinforce step
,
            ΔVRAM Allocated: 2.288818359375e-05 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:26,
            Percentage of VRAM taken: 69.91579512747062%,
            
[2025-08-20 12:16:35,256][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/policy_optimizer_state.pt
[2025-08-20 12:16:35,257][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/critic_optimizer_state.pt
[2025-08-20 12:16:36,645][__main__][INFO] - Iteration 261 took 55s (38.24% Gen, 61.75% Train). Generation: 21s, Training: 34s. Estimated remaining time: 11h 16m 53s. Estimated total time: 15h 22m 58s. Time estimates for 10 more iterations: 9m 13s, 100 more iterations: 1h 32m 17s, 500 more iterations: 7h 41m 29s.
[2025-08-20 12:16:36,647][__main__][INFO] - Starting iteration 261.
[2025-08-20 12:17:00,073][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 12:17:00,075][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 12:17:00,081][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 12:17:02,550][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 12:17:02,552][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 12:17:02,558][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 12:17:02,561][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data.
[2025-08-20 12:17:02,561][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets.
[2025-08-20 12:17:02,859][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:17:03,650][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:17:04,441][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:17:05,233][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:17:06,027][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:17:06,819][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:17:07,612][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:17:08,406][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:17:09,199][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:17:09,993][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:17:10,789][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:17:11,583][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:17:12,375][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:17:13,171][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:17:13,966][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:17:14,759][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:17:15,553][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:17:16,349][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:17:17,145][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:17:17,938][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:17:18,732][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:17:19,528][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:17:20,757][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:17:21,551][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:17:22,348][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:17:23,143][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:17:23,938][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:17:24,731][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:17:25,527][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:17:26,323][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:17:27,117][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:17:27,911][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:17:29,478][mllm.training.trainer_common][INFO] - For task: Apply reinforce step
,
            ΔVRAM Allocated: 2.288818359375e-05 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:26,
            Percentage of VRAM taken: 69.91579512747062%,
            
[2025-08-20 12:17:30,448][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/policy_optimizer_state.pt
[2025-08-20 12:17:30,450][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/critic_optimizer_state.pt
[2025-08-20 12:17:31,829][__main__][INFO] - Iteration 262 took 55s (37.97% Gen, 62.02% Train). Generation: 20s, Training: 34s. Estimated remaining time: 11h 12m 41s. Estimated total time: 15h 19m 41s. Time estimates for 10 more iterations: 9m 11s, 100 more iterations: 1h 31m 58s, 500 more iterations: 7h 39m 50s.
[2025-08-20 12:17:31,830][__main__][INFO] - Starting iteration 262.
[2025-08-20 12:17:55,057][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 12:17:55,058][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 12:17:55,065][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 12:17:57,515][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 12:17:57,516][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 12:17:57,523][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 12:17:57,525][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data.
[2025-08-20 12:17:57,525][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets.
[2025-08-20 12:17:57,824][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:17:58,614][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:17:59,405][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:18:00,197][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:18:00,990][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:18:01,781][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:18:02,576][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:18:03,370][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:18:04,163][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:18:04,957][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:18:05,753][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:18:06,548][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:18:07,340][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:18:08,136][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:18:08,933][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:18:09,728][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:18:10,521][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:18:11,315][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:18:12,115][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:18:13,392][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:18:14,187][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:18:14,979][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:18:15,773][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:18:16,569][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:18:17,362][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:18:18,154][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:18:18,951][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:18:19,748][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:18:20,541][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:18:21,334][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:18:22,130][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:18:22,926][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:18:24,489][mllm.training.trainer_common][INFO] - For task: Apply reinforce step
,
            ΔVRAM Allocated: 2.288818359375e-05 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:26,
            Percentage of VRAM taken: 69.91579512747062%,
            
[2025-08-20 12:18:25,419][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/policy_optimizer_state.pt
[2025-08-20 12:18:25,427][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/critic_optimizer_state.pt
[2025-08-20 12:18:28,363][__main__][INFO] - Iteration 263 took 56s (36.76% Gen, 63.24% Train). Generation: 20s, Training: 35s. Estimated remaining time: 11h 34m 14s. Estimated total time: 15h 42m 11s. Time estimates for 10 more iterations: 9m 25s, 100 more iterations: 1h 34m 13s, 500 more iterations: 7h 51m 5s.
[2025-08-20 12:18:28,364][__main__][INFO] - Starting iteration 263.
[2025-08-20 12:18:52,032][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 12:18:52,033][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 12:18:52,040][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 12:18:54,517][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 12:18:54,519][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 12:18:54,525][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 12:18:54,527][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data.
[2025-08-20 12:18:54,528][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets.
[2025-08-20 12:18:54,826][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:18:55,617][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:18:56,408][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:18:57,200][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:18:57,995][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:18:58,787][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:18:59,580][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:19:00,376][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:19:01,170][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:19:01,962][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:19:02,757][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:19:03,552][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:19:04,346][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:19:05,139][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:19:05,937][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:19:06,733][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:19:07,526][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:19:08,319][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:19:09,115][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:19:09,911][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:19:11,165][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:19:11,957][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:19:12,751][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:19:13,543][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:19:14,335][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:19:15,130][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:19:15,923][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:19:16,716][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:19:17,511][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:19:18,304][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:19:19,096][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:19:19,891][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:19:21,496][mllm.training.trainer_common][INFO] - For task: Apply reinforce step
,
            ΔVRAM Allocated: 2.288818359375e-05 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:26,
            Percentage of VRAM taken: 69.91579512747062%,
            
[2025-08-20 12:19:22,471][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/policy_optimizer_state.pt
[2025-08-20 12:19:22,472][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/critic_optimizer_state.pt
[2025-08-20 12:19:24,361][__main__][INFO] - Iteration 264 took 55s (37.86% Gen, 62.14% Train). Generation: 21s, Training: 34s. Estimated remaining time: 11h 24m 23s. Estimated total time: 15h 33m 16s. Time estimates for 10 more iterations: 9m 19s, 100 more iterations: 1h 33m 19s, 500 more iterations: 7h 46m 38s.
[2025-08-20 12:19:24,363][__main__][INFO] - Starting iteration 264.
[2025-08-20 12:19:48,349][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 12:19:48,350][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 12:19:48,357][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 12:19:50,785][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 12:19:50,786][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 12:19:50,793][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 12:19:50,795][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data.
[2025-08-20 12:19:50,795][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets.
[2025-08-20 12:19:51,093][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:19:51,882][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:19:52,677][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:19:53,468][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:19:54,261][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:19:55,054][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:19:55,907][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:19:56,699][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:19:57,492][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:19:58,283][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:19:59,079][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:19:59,874][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:20:00,666][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:20:01,459][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:20:02,253][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:20:03,045][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:20:03,839][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:20:04,635][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:20:05,429][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:20:06,221][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:20:07,017][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:20:07,812][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:20:08,606][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:20:09,399][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:20:10,195][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:20:11,468][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:20:12,262][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:20:13,057][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:20:13,854][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:20:14,649][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:20:15,443][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:20:16,237][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:20:17,796][mllm.training.trainer_common][INFO] - For task: Apply reinforce step
,
            ΔVRAM Allocated: 2.288818359375e-05 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:26,
            Percentage of VRAM taken: 69.91579512747062%,
            
[2025-08-20 12:20:18,852][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/policy_optimizer_state.pt
[2025-08-20 12:20:18,854][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/critic_optimizer_state.pt
[2025-08-20 12:20:20,427][__main__][INFO] - Iteration 265 took 56s (38.43% Gen, 61.57% Train). Generation: 21s, Training: 34s. Estimated remaining time: 11h 24m 33s. Estimated total time: 15h 34m 23s. Time estimates for 10 more iterations: 9m 20s, 100 more iterations: 1h 33m 26s, 500 more iterations: 7h 47m 11s.
[2025-08-20 12:20:20,429][__main__][INFO] - Starting iteration 265.
[2025-08-20 12:20:44,774][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 12:20:44,775][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 12:20:44,781][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 12:20:47,195][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 12:20:47,197][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 12:20:47,203][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 12:20:47,205][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data.
[2025-08-20 12:20:47,205][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets.
[2025-08-20 12:20:47,505][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:20:48,296][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:20:49,088][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:20:49,882][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:20:50,676][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:20:51,468][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:20:52,262][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:20:53,056][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:20:53,848][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:20:54,643][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:20:55,438][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:20:56,231][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:20:57,024][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:20:57,820][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:20:58,615][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:20:59,409][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:21:00,203][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:21:01,000][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:21:01,796][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:21:02,590][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:21:03,384][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:21:04,674][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:21:05,468][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:21:06,262][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:21:07,058][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:21:07,854][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:21:08,648][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:21:09,441][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:21:10,240][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:21:11,037][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:21:11,833][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:21:12,627][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:21:14,193][mllm.training.trainer_common][INFO] - For task: Apply reinforce step
,
            ΔVRAM Allocated: 2.288818359375e-05 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:26,
            Percentage of VRAM taken: 69.91579512747062%,
            
[2025-08-20 12:21:15,227][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/policy_optimizer_state.pt
[2025-08-20 12:21:15,230][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/critic_optimizer_state.pt
[2025-08-20 12:21:16,697][__main__][INFO] - Iteration 266 took 56s (38.94% Gen, 61.06% Train). Generation: 21s, Training: 34s. Estimated remaining time: 11h 27m 2s. Estimated total time: 15h 37m 48s. Time estimates for 10 more iterations: 9m 22s, 100 more iterations: 1h 33m 46s, 500 more iterations: 7h 48m 54s.
[2025-08-20 12:21:16,699][__main__][INFO] - Starting iteration 266.
[2025-08-20 12:21:40,111][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 12:21:40,112][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 12:21:40,118][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 12:21:42,574][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 12:21:42,575][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 12:21:42,582][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 12:21:42,584][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data.
[2025-08-20 12:21:42,584][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets.
[2025-08-20 12:21:42,882][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:21:43,673][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:21:44,465][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:21:45,258][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:21:46,050][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:21:46,844][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:21:47,638][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:21:48,431][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:21:49,226][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:21:50,022][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:21:50,816][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:21:51,609][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:21:52,405][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:21:53,201][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:21:53,995][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:21:54,788][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:21:55,584][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:21:56,381][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:21:57,177][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:21:57,970][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:21:59,210][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:22:00,002][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:22:00,796][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:22:01,589][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:22:02,382][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:22:03,178][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:22:03,971][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:22:04,764][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:22:05,560][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:22:06,355][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:22:07,147][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:22:07,940][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:22:09,546][mllm.training.trainer_common][INFO] - For task: Apply reinforce step
,
            ΔVRAM Allocated: 2.288818359375e-05 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:26,
            Percentage of VRAM taken: 69.91579512747062%,
            
[2025-08-20 12:22:10,549][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/policy_optimizer_state.pt
[2025-08-20 12:22:10,551][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/critic_optimizer_state.pt
[2025-08-20 12:22:11,988][__main__][INFO] - Iteration 267 took 55s (37.92% Gen, 62.07% Train). Generation: 20s, Training: 34s. Estimated remaining time: 11h 9m 47s. Estimated total time: 15h 21m 28s. Time estimates for 10 more iterations: 9m 12s, 100 more iterations: 1h 32m 8s, 500 more iterations: 7h 40m 44s.
[2025-08-20 12:22:11,990][__main__][INFO] - Starting iteration 267.
[2025-08-20 12:22:35,290][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 12:22:35,427][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 12:22:35,438][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 12:22:37,881][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 12:22:37,882][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 12:22:37,888][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 12:22:37,891][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data.
[2025-08-20 12:22:37,891][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets.
[2025-08-20 12:22:38,190][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:22:38,982][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:22:39,773][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:22:40,567][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:22:41,361][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:22:42,153][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:22:42,948][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:22:43,744][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:22:44,536][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:22:45,329][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:22:46,126][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:22:46,920][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:22:47,713][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:22:48,507][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:22:49,304][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:22:50,098][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:22:50,891][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:22:51,686][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:22:52,482][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:22:53,275][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:22:54,069][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:22:54,866][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:22:56,066][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:22:56,863][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:22:57,658][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:22:58,450][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:22:59,245][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:23:00,042][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:23:00,837][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:23:01,630][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:23:02,424][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:23:03,221][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:23:04,782][mllm.training.trainer_common][INFO] - For task: Apply reinforce step
,
            ΔVRAM Allocated: 2.288818359375e-05 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:26,
            Percentage of VRAM taken: 69.91579512747062%,
            
[2025-08-20 12:23:05,886][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/policy_optimizer_state.pt
[2025-08-20 12:23:05,889][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/critic_optimizer_state.pt
[2025-08-20 12:23:07,336][__main__][INFO] - Iteration 268 took 55s (37.69% Gen, 62.31% Train). Generation: 20s, Training: 34s. Estimated remaining time: 11h 9m 48s. Estimated total time: 15h 22m 25s. Time estimates for 10 more iterations: 9m 13s, 100 more iterations: 1h 32m 14s, 500 more iterations: 7h 41m 12s.
[2025-08-20 12:23:07,337][__main__][INFO] - Starting iteration 268.
[2025-08-20 12:23:31,160][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 12:23:31,162][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 12:23:31,168][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 12:23:33,629][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 12:23:33,631][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 12:23:33,637][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 12:23:33,639][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data.
[2025-08-20 12:23:33,640][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets.
[2025-08-20 12:23:33,940][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:23:34,730][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:23:35,523][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:23:36,315][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:23:37,110][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:23:37,903][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:23:38,695][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:23:39,489][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:23:40,283][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:23:41,075][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:23:41,871][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:23:42,667][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:23:43,460][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:23:44,253][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:23:45,049][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:23:45,846][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:23:46,640][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:23:47,434][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:23:48,230][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:23:49,482][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:23:50,275][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:23:51,070][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:23:51,867][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:23:52,664][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:23:53,460][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:23:54,253][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:23:55,047][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:23:55,844][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:23:56,638][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:23:57,432][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:23:58,227][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:23:59,025][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:24:00,604][mllm.training.trainer_common][INFO] - For task: Apply reinforce step
,
            ΔVRAM Allocated: 2.288818359375e-05 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:26,
            Percentage of VRAM taken: 69.91579512747062%,
            
[2025-08-20 12:24:01,672][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/policy_optimizer_state.pt
[2025-08-20 12:24:01,674][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/critic_optimizer_state.pt
[2025-08-20 12:24:02,991][__main__][INFO] - Iteration 269 took 55s (38.38% Gen, 61.61% Train). Generation: 21s, Training: 34s. Estimated remaining time: 11h 14m 0s. Estimated total time: 15h 27m 32s. Time estimates for 10 more iterations: 9m 16s, 100 more iterations: 1h 32m 45s, 500 more iterations: 7h 43m 46s.
[2025-08-20 12:24:02,992][__main__][INFO] - Starting iteration 269.
[2025-08-20 12:24:26,449][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 12:24:26,451][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 12:24:26,457][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 12:24:28,902][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 12:24:28,903][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 12:24:28,910][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 12:24:28,912][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data.
[2025-08-20 12:24:28,913][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets.
[2025-08-20 12:24:29,211][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:24:30,000][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:24:30,793][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:24:31,587][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:24:32,378][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:24:33,173][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:24:33,968][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:24:34,760][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:24:35,554][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:24:36,350][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:24:37,144][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:24:37,938][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:24:38,733][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:24:39,529][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:24:40,323][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:24:41,116][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:24:41,911][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:24:42,708][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:24:43,501][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:24:44,295][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:24:45,091][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:24:46,330][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:24:47,127][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:24:47,922][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:24:48,716][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:24:49,511][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:24:50,308][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:24:51,104][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:24:51,899][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:24:52,692][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:24:53,488][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:24:54,284][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:24:55,843][mllm.training.trainer_common][INFO] - For task: Apply reinforce step
,
            ΔVRAM Allocated: 2.288818359375e-05 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:26,
            Percentage of VRAM taken: 69.91579512747062%,
            
[2025-08-20 12:24:56,875][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/policy_optimizer_state.pt
[2025-08-20 12:24:56,877][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/critic_optimizer_state.pt
[2025-08-20 12:24:58,196][__main__][INFO] - Iteration 270 took 55s (38.04% Gen, 61.96% Train). Generation: 21s, Training: 34s. Estimated remaining time: 11h 5m 35s. Estimated total time: 15h 20m 2s. Time estimates for 10 more iterations: 9m 12s, 100 more iterations: 1h 32m 0s, 500 more iterations: 7h 40m 1s.
[2025-08-20 12:24:58,197][__main__][INFO] - Starting iteration 270.
[2025-08-20 12:25:21,447][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 12:25:21,448][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 12:25:21,455][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 12:25:23,893][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 12:25:23,894][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 12:25:23,901][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 12:25:23,903][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data.
[2025-08-20 12:25:23,903][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets.
[2025-08-20 12:25:24,202][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:25:24,994][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:25:25,785][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:25:26,578][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:25:27,374][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:25:28,166][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:25:28,959][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:25:29,755][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:25:30,550][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:25:31,342][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:25:32,137][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:25:32,933][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:25:33,727][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:25:34,520][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:25:35,315][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:25:36,111][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:25:36,905][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:25:37,699][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:25:38,494][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:25:39,291][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:25:40,085][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:25:40,877][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:25:41,673][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:25:42,469][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:25:43,702][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:25:44,496][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:25:45,292][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:25:46,089][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:25:46,886][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:25:47,680][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:25:48,474][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:25:49,270][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:25:50,843][mllm.training.trainer_common][INFO] - For task: Apply reinforce step
,
            ΔVRAM Allocated: 2.288818359375e-05 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:26,
            Percentage of VRAM taken: 69.91579512747062%,
            
[2025-08-20 12:25:51,784][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/policy_optimizer_state.pt
[2025-08-20 12:25:51,786][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/critic_optimizer_state.pt
[2025-08-20 12:25:53,201][__main__][INFO] - Iteration 271 took 55s (37.84% Gen, 62.16% Train). Generation: 20s, Training: 34s. Estimated remaining time: 11h 1m 20s. Estimated total time: 15h 16m 42s. Time estimates for 10 more iterations: 9m 10s, 100 more iterations: 1h 31m 40s, 500 more iterations: 7h 38m 21s.
[2025-08-20 12:25:53,202][__main__][INFO] - Starting iteration 271.
[2025-08-20 12:26:16,595][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 12:26:16,597][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 12:26:16,603][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 12:26:19,075][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 12:26:19,076][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 12:26:19,083][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 12:26:19,085][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data.
[2025-08-20 12:26:19,085][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets.
[2025-08-20 12:26:19,385][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:26:20,176][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:26:20,967][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:26:21,759][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:26:22,553][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:26:23,344][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:26:24,139][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:26:24,933][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:26:25,726][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:26:26,520][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:26:27,317][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:26:28,112][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:26:28,905][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:26:29,700][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:26:30,496][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:26:31,292][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:26:32,085][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:26:32,878][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:26:33,675][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:26:34,471][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:26:35,824][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:26:36,616][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:26:37,414][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:26:38,209][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:26:39,001][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:26:39,796][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:26:40,593][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:26:41,387][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:26:42,180][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:26:42,974][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:26:43,770][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:26:44,564][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:26:46,147][mllm.training.trainer_common][INFO] - For task: Apply reinforce step
,
            ΔVRAM Allocated: 2.288818359375e-05 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:27,
            Percentage of VRAM taken: 69.91579512747062%,
            
[2025-08-20 12:26:47,115][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/policy_optimizer_state.pt
[2025-08-20 12:26:47,117][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/critic_optimizer_state.pt
[2025-08-20 12:26:48,430][__main__][INFO] - Iteration 272 took 55s (37.89% Gen, 62.11% Train). Generation: 20s, Training: 34s. Estimated remaining time: 11h 4m 10s. Estimated total time: 15h 20m 27s. Time estimates for 10 more iterations: 9m 12s, 100 more iterations: 1h 32m 2s, 500 more iterations: 7h 40m 13s.
[2025-08-20 12:26:48,432][__main__][INFO] - Starting iteration 272.
[2025-08-20 12:27:11,713][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 12:27:11,715][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 12:27:11,721][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 12:27:14,177][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 12:27:14,178][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 12:27:14,184][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 12:27:14,186][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data.
[2025-08-20 12:27:14,187][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets.
[2025-08-20 12:27:14,485][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:27:15,278][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:27:16,070][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:27:16,863][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:27:17,658][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:27:18,451][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:27:19,244][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:27:20,040][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:27:20,834][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:27:21,628][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:27:22,422][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:27:23,218][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:27:24,012][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:27:24,806][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:27:25,602][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:27:26,398][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:27:27,194][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:27:27,988][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:27:28,782][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:27:29,579][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:27:30,375][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:27:31,584][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:27:32,379][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:27:33,176][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:27:33,971][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:27:34,766][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:27:35,560][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:27:36,357][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:27:37,155][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:27:37,951][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:27:38,744][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:27:39,538][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:27:41,136][mllm.training.trainer_common][INFO] - For task: Apply reinforce step
,
            ΔVRAM Allocated: 2.288818359375e-05 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:26,
            Percentage of VRAM taken: 69.91579512747062%,
            
[2025-08-20 12:27:42,254][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/policy_optimizer_state.pt
[2025-08-20 12:27:42,256][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/critic_optimizer_state.pt
[2025-08-20 12:27:43,695][__main__][INFO] - Iteration 273 took 55s (37.68% Gen, 62.32% Train). Generation: 20s, Training: 34s. Estimated remaining time: 11h 3m 50s. Estimated total time: 15h 21m 2s. Time estimates for 10 more iterations: 9m 12s, 100 more iterations: 1h 32m 6s, 500 more iterations: 7h 40m 31s.
[2025-08-20 12:27:43,698][__main__][INFO] - Starting iteration 273.
[2025-08-20 12:28:07,323][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 12:28:07,324][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 12:28:07,330][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 12:28:09,792][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 12:28:09,794][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 12:28:09,800][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 12:28:09,802][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data.
[2025-08-20 12:28:09,803][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets.
[2025-08-20 12:28:10,105][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:28:10,892][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:28:11,682][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:28:12,471][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:28:13,263][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:28:14,054][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:28:14,844][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:28:15,636][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:28:16,426][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:28:17,220][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:28:18,011][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:28:18,803][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:28:19,596][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:28:20,387][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:28:21,181][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:28:21,974][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:28:22,766][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:28:23,561][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:28:24,828][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:28:25,623][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:28:26,420][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:28:27,214][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:28:28,010][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:28:28,803][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:28:29,601][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:28:30,397][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:28:31,193][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:28:31,987][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:28:32,782][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:28:33,580][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:28:34,377][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:28:35,172][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:28:36,730][mllm.training.trainer_common][INFO] - For task: Apply reinforce step
,
            ΔVRAM Allocated: 2.288818359375e-05 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:26,
            Percentage of VRAM taken: 69.91579512747062%,
            
[2025-08-20 12:28:37,676][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/policy_optimizer_state.pt
[2025-08-20 12:28:37,678][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/critic_optimizer_state.pt
[2025-08-20 12:28:39,120][__main__][INFO] - Iteration 274 took 55s (38.20% Gen, 61.80% Train). Generation: 21s, Training: 34s. Estimated remaining time: 11h 5m 34s. Estimated total time: 15h 23m 42s. Time estimates for 10 more iterations: 9m 14s, 100 more iterations: 1h 32m 22s, 500 more iterations: 7h 41m 51s.
[2025-08-20 12:28:39,122][__main__][INFO] - Starting iteration 274.
[2025-08-20 12:29:02,363][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 12:29:02,364][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 12:29:02,370][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 12:29:04,832][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 12:29:04,833][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 12:29:04,840][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 12:29:04,842][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data.
[2025-08-20 12:29:04,842][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets.
[2025-08-20 12:29:05,145][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:29:05,938][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:29:06,729][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:29:07,524][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:29:08,315][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:29:09,107][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:29:09,902][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:29:10,695][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:29:11,488][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:29:12,284][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:29:13,077][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:29:13,870][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:29:14,667][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:29:15,463][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:29:16,256][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:29:17,049][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:29:17,846][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:29:18,643][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:29:19,438][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:29:20,232][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:29:21,114][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:29:21,995][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:29:22,876][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:29:23,731][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:29:24,549][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:29:25,854][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:29:26,649][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:29:27,443][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:29:28,241][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:29:29,037][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:29:29,832][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:29:30,626][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:29:32,193][mllm.training.trainer_common][INFO] - For task: Apply reinforce step
,
            ΔVRAM Allocated: 2.288818359375e-05 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:27,
            Percentage of VRAM taken: 69.91579512747062%,
            
[2025-08-20 12:29:33,125][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/policy_optimizer_state.pt
[2025-08-20 12:29:33,126][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/critic_optimizer_state.pt
[2025-08-20 12:29:34,401][__main__][INFO] - Iteration 275 took 55s (37.59% Gen, 62.41% Train). Generation: 20s, Training: 34s. Estimated remaining time: 11h 2m 15s. Estimated total time: 15h 21m 18s. Time estimates for 10 more iterations: 9m 12s, 100 more iterations: 1h 32m 7s, 500 more iterations: 7h 40m 39s.
[2025-08-20 12:29:34,402][__main__][INFO] - Starting iteration 275.
[2025-08-20 12:29:57,766][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 12:29:57,768][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 12:29:57,774][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 12:30:00,264][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 12:30:00,265][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 12:30:00,271][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 12:30:00,274][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data.
[2025-08-20 12:30:00,274][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets.
[2025-08-20 12:30:00,573][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:30:01,365][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:30:02,156][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:30:02,950][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:30:03,744][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:30:04,535][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:30:05,330][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:30:06,126][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:30:06,919][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:30:07,711][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:30:08,508][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:30:09,302][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:30:10,095][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:30:10,890][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:30:11,686][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:30:12,480][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:30:13,274][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:30:14,068][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:30:14,865][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:30:15,658][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:30:16,451][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:30:17,247][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:30:18,044][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:30:19,251][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:30:20,044][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:30:20,838][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:30:21,630][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:30:22,425][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:30:23,221][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:30:24,015][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:30:24,808][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:30:25,604][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:30:27,161][mllm.training.trainer_common][INFO] - For task: Apply reinforce step
,
            ΔVRAM Allocated: 2.288818359375e-05 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:26,
            Percentage of VRAM taken: 69.91579512747062%,
            
[2025-08-20 12:30:28,158][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/policy_optimizer_state.pt
[2025-08-20 12:30:28,159][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/critic_optimizer_state.pt
[2025-08-20 12:30:29,612][__main__][INFO] - Iteration 276 took 55s (37.78% Gen, 62.22% Train). Generation: 20s, Training: 34s. Estimated remaining time: 11h 0m 10s. Estimated total time: 15h 20m 9s. Time estimates for 10 more iterations: 9m 12s, 100 more iterations: 1h 32m 0s, 500 more iterations: 7h 40m 4s.
[2025-08-20 12:30:29,614][__main__][INFO] - Starting iteration 276.
[2025-08-20 12:30:52,980][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 12:30:52,992][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 12:30:53,004][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 12:30:55,462][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 12:30:55,463][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 12:30:55,470][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 12:30:55,472][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data.
[2025-08-20 12:30:55,472][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets.
[2025-08-20 12:30:55,822][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:30:56,624][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:30:57,415][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:30:58,210][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:30:59,002][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:30:59,795][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:31:00,591][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:31:01,385][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:31:02,179][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:31:02,973][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:31:03,770][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:31:04,565][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:31:05,358][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:31:06,153][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:31:06,950][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:31:07,745][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:31:08,538][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:31:09,331][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:31:10,129][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:31:10,925][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:31:11,718][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:31:12,512][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:31:13,309][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:31:14,105][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:31:14,900][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:31:15,694][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:31:16,489][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:31:17,742][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:31:18,537][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:31:19,331][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:31:20,133][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:31:20,929][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:31:22,508][mllm.training.trainer_common][INFO] - For task: Apply reinforce step
,
            ΔVRAM Allocated: 2.288818359375e-05 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:26,
            Percentage of VRAM taken: 69.91579512747062%,
            
[2025-08-20 12:31:23,475][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/policy_optimizer_state.pt
[2025-08-20 12:31:23,476][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/critic_optimizer_state.pt
[2025-08-20 12:31:24,934][__main__][INFO] - Iteration 277 took 55s (37.79% Gen, 62.21% Train). Generation: 20s, Training: 34s. Estimated remaining time: 11h 1m 6s. Estimated total time: 15h 21m 59s. Time estimates for 10 more iterations: 9m 13s, 100 more iterations: 1h 32m 11s, 500 more iterations: 7h 40m 59s.
[2025-08-20 12:31:24,936][__main__][INFO] - Starting iteration 277.
[2025-08-20 12:31:49,002][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 12:31:49,003][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 12:31:49,010][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 12:31:51,468][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 12:31:51,470][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 12:31:51,476][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 12:31:51,478][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data.
[2025-08-20 12:31:51,479][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets.
[2025-08-20 12:31:51,777][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:31:52,569][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:31:53,359][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:31:54,154][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:31:54,946][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:31:55,739][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:31:56,536][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:31:57,330][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:31:58,123][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:31:58,917][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:31:59,714][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:32:00,509][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:32:01,301][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:32:02,094][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:32:02,890][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:32:03,683][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:32:04,476][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:32:05,273][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:32:06,068][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:32:07,352][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:32:08,147][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:32:08,941][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:32:09,735][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:32:10,532][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:32:11,329][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:32:12,125][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:32:12,918][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:32:13,713][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:32:14,510][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:32:15,308][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:32:16,104][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:32:16,898][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:32:18,506][mllm.training.trainer_common][INFO] - For task: Apply reinforce step
,
            ΔVRAM Allocated: 2.288818359375e-05 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:26,
            Percentage of VRAM taken: 69.91579512747062%,
            
[2025-08-20 12:32:19,545][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/policy_optimizer_state.pt
[2025-08-20 12:32:19,547][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/critic_optimizer_state.pt
[2025-08-20 12:32:20,854][__main__][INFO] - Iteration 278 took 55s (38.60% Gen, 61.40% Train). Generation: 21s, Training: 34s. Estimated remaining time: 11h 10m 1s. Estimated total time: 15h 31m 51s. Time estimates for 10 more iterations: 9m 19s, 100 more iterations: 1h 33m 11s, 500 more iterations: 7h 45m 55s.
[2025-08-20 12:32:20,857][__main__][INFO] - Starting iteration 278.
[2025-08-20 12:32:44,280][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 12:32:44,281][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 12:32:44,287][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 12:32:46,777][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 12:32:46,778][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 12:32:46,784][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 12:32:46,786][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data.
[2025-08-20 12:32:46,787][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets.
[2025-08-20 12:32:47,087][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:32:47,877][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:32:48,670][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:32:49,461][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:32:50,257][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:32:51,051][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:32:51,844][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:32:52,638][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:32:53,434][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:32:54,227][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:32:55,020][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:32:55,816][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:32:56,612][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:32:57,405][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:32:58,199][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:32:58,996][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:32:59,791][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:33:00,585][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:33:01,790][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:33:02,584][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:33:03,377][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:33:04,174][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:33:04,970][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:33:05,764][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:33:06,558][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:33:07,354][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:33:08,151][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:33:08,948][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:33:09,741][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:33:10,535][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:33:11,332][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:33:12,129][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:33:13,702][mllm.training.trainer_common][INFO] - For task: Apply reinforce step
,
            ΔVRAM Allocated: 2.288818359375e-05 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:26,
            Percentage of VRAM taken: 69.91579512747062%,
            
[2025-08-20 12:33:14,659][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/policy_optimizer_state.pt
[2025-08-20 12:33:14,661][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/critic_optimizer_state.pt
[2025-08-20 12:33:16,041][__main__][INFO] - Iteration 279 took 55s (37.93% Gen, 62.06% Train). Generation: 20s, Training: 34s. Estimated remaining time: 10h 56m 58s. Estimated total time: 15h 19m 43s. Time estimates for 10 more iterations: 9m 11s, 100 more iterations: 1h 31m 58s, 500 more iterations: 7h 39m 51s.
[2025-08-20 12:33:16,042][__main__][INFO] - Starting iteration 279.
[2025-08-20 12:33:39,366][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 12:33:39,367][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 12:33:39,374][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 12:33:41,825][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 12:33:41,827][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 12:33:41,834][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 12:33:41,836][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data.
[2025-08-20 12:33:41,836][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets.
[2025-08-20 12:33:42,136][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:33:42,925][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:33:43,720][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:33:44,515][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:33:45,307][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:33:46,102][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:33:46,897][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:33:47,689][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:33:48,482][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:33:49,279][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:33:50,073][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:33:50,866][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:33:51,660][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:33:52,457][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:33:53,251][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:33:54,044][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:33:54,840][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:33:55,637][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:33:56,433][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:33:57,226][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:33:58,021][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:33:58,819][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:34:00,091][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:34:00,884][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:34:01,678][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:34:02,474][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:34:03,268][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:34:04,061][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:34:04,856][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:34:05,651][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:34:06,444][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:34:07,237][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:34:08,798][mllm.training.trainer_common][INFO] - For task: Apply reinforce step
,
            ΔVRAM Allocated: 2.288818359375e-05 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:26,
            Percentage of VRAM taken: 69.91579512747062%,
            
[2025-08-20 12:34:09,772][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/policy_optimizer_state.pt
[2025-08-20 12:34:09,774][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/critic_optimizer_state.pt
[2025-08-20 12:34:11,224][__main__][INFO] - Iteration 280 took 55s (37.83% Gen, 62.17% Train). Generation: 20s, Training: 34s. Estimated remaining time: 10h 56m 1s. Estimated total time: 15h 19m 41s. Time estimates for 10 more iterations: 9m 11s, 100 more iterations: 1h 31m 58s, 500 more iterations: 7h 39m 50s.
[2025-08-20 12:34:11,226][__main__][INFO] - Starting iteration 280.
[2025-08-20 12:34:34,962][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 12:34:34,963][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 12:34:34,970][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 12:34:37,420][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 12:34:37,421][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 12:34:37,427][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 12:34:37,429][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data.
[2025-08-20 12:34:37,430][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets.
[2025-08-20 12:34:37,729][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:34:38,521][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:34:39,311][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:34:40,104][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:34:40,899][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:34:41,690][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:34:42,484][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:34:43,278][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:34:44,070][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:34:44,865][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:34:45,661][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:34:46,456][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:34:47,250][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:34:48,044][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:34:48,841][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:34:49,636][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:34:50,430][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:34:51,224][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:34:52,021][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:34:52,817][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:34:53,611][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:34:54,405][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:34:55,201][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:34:56,473][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:34:57,267][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:34:58,060][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:34:58,857][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:34:59,653][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:35:00,445][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:35:01,239][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:35:02,036][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:35:02,830][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:35:04,387][mllm.training.trainer_common][INFO] - For task: Apply reinforce step
,
            ΔVRAM Allocated: 2.288818359375e-05 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:26,
            Percentage of VRAM taken: 69.91579512747062%,
            
[2025-08-20 12:35:05,368][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/policy_optimizer_state.pt
[2025-08-20 12:35:05,370][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/critic_optimizer_state.pt
[2025-08-20 12:35:07,398][__main__][INFO] - Iteration 281 took 56s (37.88% Gen, 62.12% Train). Generation: 21s, Training: 34s. Estimated remaining time: 11h 11m 36s. Estimated total time: 15h 36m 12s. Time estimates for 10 more iterations: 9m 21s, 100 more iterations: 1h 33m 37s, 500 more iterations: 7h 48m 6s.
[2025-08-20 12:35:07,400][__main__][INFO] - Starting iteration 281.
[2025-08-20 12:35:31,717][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 12:35:31,718][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 12:35:31,724][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 12:35:34,175][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 12:35:34,177][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 12:35:34,183][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 12:35:34,185][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data.
[2025-08-20 12:35:34,185][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets.
[2025-08-20 12:35:34,483][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:35:35,272][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:35:36,064][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:35:36,855][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:35:37,646][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:35:38,439][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:35:39,231][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:35:40,023][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:35:40,816][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:35:41,607][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:35:42,403][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:35:43,195][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:35:43,987][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:35:44,783][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:35:45,577][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:35:46,369][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:35:47,164][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:35:47,958][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:35:48,751][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:35:49,544][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:35:50,341][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:35:51,551][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:35:52,345][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:35:53,142][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:35:53,939][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:35:54,733][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:35:55,526][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:35:56,322][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:35:57,120][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:35:57,915][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:35:58,708][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:35:59,504][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:36:01,097][mllm.training.trainer_common][INFO] - For task: Apply reinforce step
,
            ΔVRAM Allocated: 2.288818359375e-05 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:26,
            Percentage of VRAM taken: 69.91579512747062%,
            
[2025-08-20 12:36:02,082][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/policy_optimizer_state.pt
[2025-08-20 12:36:02,084][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/critic_optimizer_state.pt
[2025-08-20 12:36:03,535][__main__][INFO] - Iteration 282 took 56s (38.94% Gen, 61.06% Train). Generation: 21s, Training: 34s. Estimated remaining time: 11h 10m 1s. Estimated total time: 15h 35m 34s. Time estimates for 10 more iterations: 9m 21s, 100 more iterations: 1h 33m 33s, 500 more iterations: 7h 47m 47s.
[2025-08-20 12:36:03,536][__main__][INFO] - Starting iteration 282.
[2025-08-20 12:36:27,054][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 12:36:27,056][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 12:36:27,062][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 12:36:29,524][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 12:36:29,525][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 12:36:29,531][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 12:36:29,533][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data.
[2025-08-20 12:36:29,534][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets.
[2025-08-20 12:36:29,833][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:36:30,626][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:36:31,416][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:36:32,209][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:36:33,004][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:36:33,795][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:36:34,588][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:36:35,382][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:36:36,175][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:36:36,968][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:36:37,764][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:36:38,556][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:36:39,348][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:36:40,144][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:36:40,937][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:36:41,730][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:36:42,526][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:36:43,321][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:36:44,114][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:36:45,390][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:36:46,184][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:36:46,979][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:36:47,773][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:36:48,567][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:36:49,364][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:36:50,160][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:36:50,954][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:36:51,748][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:36:52,543][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:36:53,340][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:36:54,135][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:36:54,929][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:36:56,521][mllm.training.trainer_common][INFO] - For task: Apply reinforce step
,
            ΔVRAM Allocated: 2.288818359375e-05 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:26,
            Percentage of VRAM taken: 69.91579512747062%,
            
[2025-08-20 12:36:57,679][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/policy_optimizer_state.pt
[2025-08-20 12:36:57,681][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/critic_optimizer_state.pt
[2025-08-20 12:36:59,136][__main__][INFO] - Iteration 283 took 55s (37.90% Gen, 62.10% Train). Generation: 21s, Training: 34s. Estimated remaining time: 11h 0m 10s. Estimated total time: 15h 26m 39s. Time estimates for 10 more iterations: 9m 15s, 100 more iterations: 1h 32m 39s, 500 more iterations: 7h 43m 19s.
[2025-08-20 12:36:59,137][__main__][INFO] - Starting iteration 283.
[2025-08-20 12:37:22,441][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 12:37:22,443][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 12:37:22,449][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 12:37:24,927][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 12:37:24,929][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 12:37:24,935][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 12:37:24,937][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data.
[2025-08-20 12:37:24,938][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets.
[2025-08-20 12:37:25,236][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:37:26,029][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:37:26,821][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:37:27,613][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:37:28,409][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:37:29,201][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:37:29,994][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:37:30,790][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:37:31,585][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:37:32,378][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:37:33,172][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:37:33,969][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:37:34,763][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:37:35,556][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:37:36,351][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:37:37,147][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:37:37,941][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:37:38,734][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:37:39,528][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:37:40,324][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:37:41,641][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:37:42,434][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:37:43,229][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:37:44,026][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:37:44,821][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:37:45,614][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:37:46,409][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:37:47,206][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:37:48,003][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:37:48,797][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:37:49,592][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:37:50,388][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:37:52,001][mllm.training.trainer_common][INFO] - For task: Apply reinforce step
,
            ΔVRAM Allocated: 2.288818359375e-05 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:27,
            Percentage of VRAM taken: 69.91579512747062%,
            
[2025-08-20 12:37:52,966][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/policy_optimizer_state.pt
[2025-08-20 12:37:52,968][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/critic_optimizer_state.pt
[2025-08-20 12:37:54,282][__main__][INFO] - Iteration 284 took 55s (37.79% Gen, 62.20% Train). Generation: 20s, Training: 34s. Estimated remaining time: 10h 51m 41s. Estimated total time: 15h 19m 4s. Time estimates for 10 more iterations: 9m 11s, 100 more iterations: 1h 31m 54s, 500 more iterations: 7h 39m 32s.
[2025-08-20 12:37:54,284][__main__][INFO] - Starting iteration 284.
[2025-08-20 12:38:17,502][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 12:38:17,503][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 12:38:17,510][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 12:38:19,960][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 12:38:19,962][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 12:38:19,968][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 12:38:19,970][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data.
[2025-08-20 12:38:19,971][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets.
[2025-08-20 12:38:20,269][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:38:21,059][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:38:21,854][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:38:22,647][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:38:23,440][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:38:24,234][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:38:25,029][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:38:25,822][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:38:26,615][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:38:27,412][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:38:28,207][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:38:29,000][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:38:29,794][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:38:30,591][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:38:31,386][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:38:32,180][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:38:32,974][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:38:33,770][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:38:34,566][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:38:35,360][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:38:36,154][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:38:37,357][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:38:38,153][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:38:38,950][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:38:39,746][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:38:40,540][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:38:41,335][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:38:42,131][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:38:42,925][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:38:43,719][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:38:44,513][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:38:45,311][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:38:46,907][mllm.training.trainer_common][INFO] - For task: Apply reinforce step
,
            ΔVRAM Allocated: 2.288818359375e-05 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:26,
            Percentage of VRAM taken: 69.91579512747062%,
            
[2025-08-20 12:38:47,849][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/policy_optimizer_state.pt
[2025-08-20 12:38:47,850][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/critic_optimizer_state.pt
[2025-08-20 12:38:49,219][__main__][INFO] - Iteration 285 took 54s (37.82% Gen, 62.18% Train). Generation: 20s, Training: 34s. Estimated remaining time: 10h 47m 16s. Estimated total time: 15h 15m 35s. Time estimates for 10 more iterations: 9m 9s, 100 more iterations: 1h 31m 33s, 500 more iterations: 7h 37m 47s.
[2025-08-20 12:38:49,221][__main__][INFO] - Starting iteration 285.
[2025-08-20 12:39:12,378][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 12:39:12,380][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 12:39:12,386][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 12:39:14,849][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 12:39:14,850][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 12:39:14,857][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 12:39:14,859][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data.
[2025-08-20 12:39:14,859][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets.
[2025-08-20 12:39:15,158][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:39:15,950][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:39:16,742][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:39:17,537][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:39:18,331][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:39:19,123][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:39:19,917][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:39:20,712][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:39:21,505][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:39:22,298][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:39:23,094][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:39:23,888][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:39:24,681][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:39:25,475][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:39:26,271][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:39:27,064][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:39:27,858][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:39:28,655][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:39:29,937][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:39:30,733][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:39:31,530][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:39:32,324][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:39:33,117][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:39:33,913][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:39:34,710][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:39:35,503][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:39:36,297][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:39:37,091][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:39:37,887][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:39:38,680][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:39:39,475][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:39:40,272][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:39:41,863][mllm.training.trainer_common][INFO] - For task: Apply reinforce step
,
            ΔVRAM Allocated: 2.288818359375e-05 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:26,
            Percentage of VRAM taken: 69.91579512747062%,
            
[2025-08-20 12:39:42,800][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/policy_optimizer_state.pt
[2025-08-20 12:39:42,802][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/critic_optimizer_state.pt
[2025-08-20 12:39:44,116][__main__][INFO] - Iteration 286 took 54s (37.70% Gen, 62.30% Train). Generation: 20s, Training: 34s. Estimated remaining time: 10h 45m 42s. Estimated total time: 15h 14m 55s. Time estimates for 10 more iterations: 9m 8s, 100 more iterations: 1h 31m 29s, 500 more iterations: 7h 37m 27s.
[2025-08-20 12:39:44,118][__main__][INFO] - Starting iteration 286.
[2025-08-20 12:40:07,528][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 12:40:07,530][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 12:40:07,536][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 12:40:09,981][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 12:40:09,982][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 12:40:09,989][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 12:40:09,991][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data.
[2025-08-20 12:40:09,991][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets.
[2025-08-20 12:40:10,289][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:40:11,077][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:40:11,866][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:40:12,658][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:40:13,448][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:40:14,239][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:40:15,031][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:40:15,821][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:40:16,616][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:40:17,407][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:40:18,199][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:40:18,992][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:40:19,783][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:40:20,576][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:40:21,368][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:40:22,159][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:40:22,954][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:40:23,745][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:40:24,537][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:40:25,768][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:40:26,562][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:40:27,356][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:40:28,153][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:40:28,948][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:40:29,741][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:40:30,538][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:40:31,335][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:40:32,132][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:40:32,926][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:40:33,721][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:40:34,517][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:40:35,314][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:40:36,968][mllm.training.trainer_common][INFO] - For task: Apply reinforce step
,
            ΔVRAM Allocated: 2.288818359375e-05 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:26,
            Percentage of VRAM taken: 69.91579512747062%,
            
[2025-08-20 12:40:37,927][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/policy_optimizer_state.pt
[2025-08-20 12:40:37,929][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/critic_optimizer_state.pt
[2025-08-20 12:40:39,272][__main__][INFO] - Iteration 287 took 55s (38.02% Gen, 61.97% Train). Generation: 20s, Training: 34s. Estimated remaining time: 10h 49m 5s. Estimated total time: 15h 19m 13s. Time estimates for 10 more iterations: 9m 11s, 100 more iterations: 1h 31m 55s, 500 more iterations: 7h 39m 36s.
[2025-08-20 12:40:39,274][__main__][INFO] - Starting iteration 287.
[2025-08-20 12:41:02,652][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 12:41:02,654][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 12:41:02,660][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 12:41:05,114][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 12:41:05,116][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 12:41:05,122][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 12:41:05,124][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data.
[2025-08-20 12:41:05,125][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets.
[2025-08-20 12:41:05,423][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:41:06,215][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:41:07,007][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:41:07,802][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:41:08,596][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:41:09,388][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:41:10,183][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:41:10,978][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:41:11,770][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:41:12,563][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:41:13,360][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:41:14,154][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:41:14,947][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:41:15,742][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:41:16,538][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:41:17,332][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:41:18,127][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:41:18,921][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:41:19,718][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:41:20,513][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:41:21,306][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:41:22,100][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:41:22,898][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:41:23,694][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:41:24,487][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:41:25,690][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:41:26,483][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:41:27,278][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:41:28,075][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:41:28,872][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:41:29,666][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:41:30,459][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:41:32,069][mllm.training.trainer_common][INFO] - For task: Apply reinforce step
,
            ΔVRAM Allocated: 2.288818359375e-05 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:26,
            Percentage of VRAM taken: 69.91579512747062%,
            
[2025-08-20 12:41:33,505][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/policy_optimizer_state.pt
[2025-08-20 12:41:33,507][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/critic_optimizer_state.pt
[2025-08-20 12:41:34,933][__main__][INFO] - Iteration 288 took 55s (37.59% Gen, 62.41% Train). Generation: 20s, Training: 34s. Estimated remaining time: 10h 56m 34s. Estimated total time: 15h 27m 38s. Time estimates for 10 more iterations: 9m 16s, 100 more iterations: 1h 32m 45s, 500 more iterations: 7h 43m 49s.
[2025-08-20 12:41:34,934][__main__][INFO] - Starting iteration 288.
[2025-08-20 12:41:58,068][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 12:41:58,069][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 12:41:58,075][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 12:42:00,552][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 12:42:00,554][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 12:42:00,560][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 12:42:00,562][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data.
[2025-08-20 12:42:00,563][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets.
[2025-08-20 12:42:00,861][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:42:01,648][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:42:02,441][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:42:03,232][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:42:04,024][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:42:04,817][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:42:05,609][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:42:06,404][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:42:07,197][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:42:07,990][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:42:08,784][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:42:09,578][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:42:10,370][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:42:11,164][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:42:11,959][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:42:12,752][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:42:13,545][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:42:14,341][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:42:15,602][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:42:16,398][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:42:17,192][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:42:17,986][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:42:18,782][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:42:19,579][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:42:20,374][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:42:21,167][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:42:21,962][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:42:22,759][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:42:23,555][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:42:24,350][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:42:25,144][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:42:25,941][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:42:27,563][mllm.training.trainer_common][INFO] - For task: Apply reinforce step
,
            ΔVRAM Allocated: 2.288818359375e-05 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:26,
            Percentage of VRAM taken: 69.91579512747062%,
            
[2025-08-20 12:42:28,478][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/policy_optimizer_state.pt
[2025-08-20 12:42:28,480][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/critic_optimizer_state.pt
[2025-08-20 12:42:29,923][__main__][INFO] - Iteration 289 took 54s (37.61% Gen, 62.38% Train). Generation: 20s, Training: 34s. Estimated remaining time: 10h 44m 29s. Estimated total time: 15h 16m 28s. Time estimates for 10 more iterations: 9m 9s, 100 more iterations: 1h 31m 38s, 500 more iterations: 7h 38m 14s.
[2025-08-20 12:42:29,924][__main__][INFO] - Starting iteration 289.
[2025-08-20 12:42:54,091][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 12:42:54,092][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 12:42:54,099][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 12:42:56,533][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 12:42:56,535][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 12:42:56,541][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 12:42:56,543][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data.
[2025-08-20 12:42:56,544][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets.
[2025-08-20 12:42:56,841][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:42:57,629][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:42:58,420][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:42:59,210][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:43:00,005][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:43:00,795][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:43:01,588][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:43:02,381][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:43:03,172][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:43:03,965][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:43:04,758][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:43:05,549][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:43:06,344][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:43:07,137][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:43:07,929][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:43:08,726][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:43:09,521][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:43:10,313][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:43:11,106][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:43:12,336][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:43:13,130][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:43:13,926][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:43:14,720][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:43:15,514][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:43:16,307][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:43:17,104][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:43:17,901][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:43:18,698][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:43:19,493][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:43:20,287][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:43:21,084][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:43:21,882][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:43:23,520][mllm.training.trainer_common][INFO] - For task: Apply reinforce step
,
            ΔVRAM Allocated: 2.288818359375e-05 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:26,
            Percentage of VRAM taken: 69.91579512747062%,
            
[2025-08-20 12:43:24,505][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/policy_optimizer_state.pt
[2025-08-20 12:43:24,507][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/critic_optimizer_state.pt
[2025-08-20 12:43:26,013][__main__][INFO] - Iteration 290 took 56s (38.72% Gen, 61.28% Train). Generation: 21s, Training: 34s. Estimated remaining time: 11h 1m 52s. Estimated total time: 15h 34m 47s. Time estimates for 10 more iterations: 9m 20s, 100 more iterations: 1h 33m 28s, 500 more iterations: 7h 47m 23s.
[2025-08-20 12:43:26,014][__main__][INFO] - Starting iteration 290.
[2025-08-20 12:43:49,072][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 12:43:49,073][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 12:43:49,079][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 12:43:51,536][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 12:43:51,538][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 12:43:51,544][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 12:43:51,546][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data.
[2025-08-20 12:43:51,546][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets.
[2025-08-20 12:43:51,844][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:43:52,634][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:43:53,429][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:43:54,221][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:43:55,014][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:43:55,809][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:43:56,603][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:43:57,396][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:43:58,191][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:43:58,987][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:43:59,781][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:44:00,574][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:44:01,369][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:44:02,165][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:44:02,958][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:44:03,752][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:44:04,546][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:44:05,339][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:44:06,133][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:44:07,394][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:44:08,188][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:44:08,985][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:44:09,779][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:44:10,572][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:44:11,368][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:44:12,166][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:44:12,963][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:44:13,759][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:44:14,553][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:44:15,348][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:44:16,146][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:44:16,943][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:44:18,546][mllm.training.trainer_common][INFO] - For task: Apply reinforce step
,
            ΔVRAM Allocated: 2.288818359375e-05 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:26,
            Percentage of VRAM taken: 69.91579512747062%,
            
[2025-08-20 12:44:19,626][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/policy_optimizer_state.pt
[2025-08-20 12:44:19,629][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/critic_optimizer_state.pt
[2025-08-20 12:44:21,174][__main__][INFO] - Iteration 291 took 55s (37.36% Gen, 62.63% Train). Generation: 20s, Training: 34s. Estimated remaining time: 10h 45m 29s. Estimated total time: 15h 19m 19s. Time estimates for 10 more iterations: 9m 11s, 100 more iterations: 1h 31m 55s, 500 more iterations: 7h 39m 39s.
[2025-08-20 12:44:21,175][__main__][INFO] - Starting iteration 291.
[2025-08-20 12:44:44,298][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 12:44:44,299][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 12:44:44,306][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 12:44:46,765][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 12:44:46,767][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 12:44:46,773][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 12:44:46,775][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data.
[2025-08-20 12:44:46,776][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets.
[2025-08-20 12:44:47,077][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:44:47,870][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:44:48,660][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:44:49,454][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:44:50,248][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:44:51,039][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:44:51,834][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:44:52,629][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:44:53,421][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:44:54,214][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:44:55,011][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:44:55,805][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:44:56,599][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:44:57,392][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:44:58,189][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:44:58,982][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:44:59,776][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:45:00,571][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:45:01,368][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:45:02,162][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:45:02,954][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:45:03,750][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:45:05,036][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:45:05,830][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:45:06,627][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:45:07,422][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:45:08,216][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:45:09,011][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:45:09,808][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:45:10,603][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:45:11,398][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:45:12,191][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:45:13,810][mllm.training.trainer_common][INFO] - For task: Apply reinforce step
,
            ΔVRAM Allocated: 2.288818359375e-05 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:26,
            Percentage of VRAM taken: 69.91579512747062%,
            
[2025-08-20 12:45:14,889][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/policy_optimizer_state.pt
[2025-08-20 12:45:14,891][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/critic_optimizer_state.pt
[2025-08-20 12:45:16,212][__main__][INFO] - Iteration 292 took 55s (37.55% Gen, 62.45% Train). Generation: 20s, Training: 34s. Estimated remaining time: 10h 42m 31s. Estimated total time: 15h 17m 16s. Time estimates for 10 more iterations: 9m 10s, 100 more iterations: 1h 31m 43s, 500 more iterations: 7h 38m 38s.
[2025-08-20 12:45:16,214][__main__][INFO] - Starting iteration 292.
[2025-08-20 12:45:39,673][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 12:45:39,674][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 12:45:39,681][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 12:45:42,124][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 12:45:42,125][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 12:45:42,132][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 12:45:42,134][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data.
[2025-08-20 12:45:42,135][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets.
[2025-08-20 12:45:42,432][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:45:43,222][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:45:44,015][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:45:44,808][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:45:45,600][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:45:46,394][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:45:47,187][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:45:47,979][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:45:48,774][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:45:49,569][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:45:50,362][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:45:51,156][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:45:51,953][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:45:52,747][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:45:53,540][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:45:54,334][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:45:55,130][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:45:55,923][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:45:56,717][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:45:57,513][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:45:58,752][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:45:59,547][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:46:00,340][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:46:01,134][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:46:01,930][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:46:02,725][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:46:03,519][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:46:04,314][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:46:05,111][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:46:05,907][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:46:06,701][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:46:07,495][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:46:09,132][mllm.training.trainer_common][INFO] - For task: Apply reinforce step
,
            ΔVRAM Allocated: 2.288818359375e-05 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:26,
            Percentage of VRAM taken: 69.91579512747062%,
            
[2025-08-20 12:46:10,032][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/policy_optimizer_state.pt
[2025-08-20 12:46:10,034][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/critic_optimizer_state.pt
[2025-08-20 12:46:11,366][__main__][INFO] - Iteration 293 took 55s (38.09% Gen, 61.91% Train). Generation: 21s, Training: 34s. Estimated remaining time: 10h 43m 31s. Estimated total time: 15h 19m 11s. Time estimates for 10 more iterations: 9m 11s, 100 more iterations: 1h 31m 55s, 500 more iterations: 7h 39m 35s.
[2025-08-20 12:46:11,368][__main__][INFO] - Starting iteration 293.
[2025-08-20 12:46:34,421][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 12:46:34,422][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 12:46:34,428][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 12:46:36,874][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 12:46:36,875][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 12:46:36,881][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 12:46:36,883][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data.
[2025-08-20 12:46:36,884][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets.
[2025-08-20 12:46:37,182][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:46:37,972][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:46:38,762][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:46:39,556][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:46:40,346][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:46:41,139][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:46:41,932][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:46:42,723][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:46:43,516][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:46:44,308][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:46:45,100][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:46:45,895][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:46:46,688][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:46:47,480][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:46:48,276][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:46:49,070][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:46:49,863][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:46:50,657][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:46:51,451][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:46:52,243][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:46:53,037][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:46:54,238][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:46:55,034][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:46:55,828][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:46:56,621][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:46:57,415][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:46:58,212][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:46:59,007][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:46:59,800][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:47:00,595][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:47:01,392][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:47:02,190][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:47:03,807][mllm.training.trainer_common][INFO] - For task: Apply reinforce step
,
            ΔVRAM Allocated: 2.288818359375e-05 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:26,
            Percentage of VRAM taken: 69.91579512747062%,
            
[2025-08-20 12:47:04,742][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/policy_optimizer_state.pt
[2025-08-20 12:47:04,744][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/critic_optimizer_state.pt
[2025-08-20 12:47:06,063][__main__][INFO] - Iteration 294 took 54s (37.70% Gen, 62.30% Train). Generation: 20s, Training: 34s. Estimated remaining time: 10h 35m 0s. Estimated total time: 15h 11m 35s. Time estimates for 10 more iterations: 9m 6s, 100 more iterations: 1h 31m 9s, 500 more iterations: 7h 35m 47s.
[2025-08-20 12:47:06,065][__main__][INFO] - Starting iteration 294.
[2025-08-20 12:47:29,261][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 12:47:29,263][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 12:47:29,269][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 12:47:31,712][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 12:47:31,714][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 12:47:31,720][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 12:47:31,722][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data.
[2025-08-20 12:47:31,723][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets.
[2025-08-20 12:47:32,022][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:47:32,815][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:47:33,606][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:47:34,401][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:47:35,195][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:47:35,988][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:47:36,782][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:47:37,578][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:47:38,370][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:47:39,163][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:47:39,959][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:47:40,754][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:47:41,548][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:47:42,342][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:47:43,138][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:47:43,933][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:47:44,726][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:47:45,520][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:47:46,318][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:47:47,113][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:47:47,906][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:47:48,700][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:47:49,498][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:47:50,839][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:47:51,635][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:47:52,428][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:47:53,221][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:47:54,017][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:47:54,814][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:47:55,607][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:47:56,402][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:47:57,197][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:47:58,832][mllm.training.trainer_common][INFO] - For task: Apply reinforce step
,
            ΔVRAM Allocated: 2.288818359375e-05 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:27,
            Percentage of VRAM taken: 69.91579512747062%,
            
[2025-08-20 12:47:59,985][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/policy_optimizer_state.pt
[2025-08-20 12:47:59,988][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/critic_optimizer_state.pt
[2025-08-20 12:48:01,304][__main__][INFO] - Iteration 295 took 55s (37.59% Gen, 62.40% Train). Generation: 20s, Training: 34s. Estimated remaining time: 10h 43m 8s. Estimated total time: 15h 20m 38s. Time estimates for 10 more iterations: 9m 12s, 100 more iterations: 1h 32m 3s, 500 more iterations: 7h 40m 19s.
[2025-08-20 12:48:01,306][__main__][INFO] - Starting iteration 295.
[2025-08-20 12:48:24,378][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 12:48:24,379][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 12:48:24,386][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 12:48:26,829][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 12:48:26,830][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 12:48:26,837][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 12:48:26,839][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data.
[2025-08-20 12:48:26,839][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets.
[2025-08-20 12:48:27,142][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:48:27,931][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:48:28,723][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:48:29,515][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:48:30,307][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:48:31,102][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:48:31,894][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:48:32,685][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:48:33,481][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:48:34,273][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:48:35,065][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:48:35,862][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:48:36,656][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:48:37,448][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:48:38,243][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:48:39,039][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:48:39,832][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:48:40,625][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:48:41,907][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:48:42,700][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:48:43,493][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:48:44,284][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:48:45,078][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:48:45,872][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:48:46,664][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:48:47,458][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:48:48,254][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:48:49,047][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:48:49,840][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:48:50,636][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:48:51,429][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:48:52,222][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:48:53,823][mllm.training.trainer_common][INFO] - For task: Apply reinforce step
,
            ΔVRAM Allocated: 2.288818359375e-05 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:26,
            Percentage of VRAM taken: 69.91579512747062%,
            
[2025-08-20 12:48:54,771][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/policy_optimizer_state.pt
[2025-08-20 12:48:54,773][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/critic_optimizer_state.pt
[2025-08-20 12:48:56,157][__main__][INFO] - Iteration 296 took 54s (37.62% Gen, 62.38% Train). Generation: 20s, Training: 34s. Estimated remaining time: 10h 35m 45s. Estimated total time: 15h 14m 10s. Time estimates for 10 more iterations: 9m 8s, 100 more iterations: 1h 31m 25s, 500 more iterations: 7h 37m 5s.
[2025-08-20 12:48:56,158][__main__][INFO] - Starting iteration 296.
[2025-08-20 12:49:19,288][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 12:49:19,290][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 12:49:19,296][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 12:49:21,756][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 12:49:21,757][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 12:49:21,763][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 12:49:21,766][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data.
[2025-08-20 12:49:21,766][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets.
[2025-08-20 12:49:22,065][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:49:22,854][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:49:23,647][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:49:24,441][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:49:25,233][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:49:26,027][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:49:26,821][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:49:27,613][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:49:28,406][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:49:29,201][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:49:29,994][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:49:30,787][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:49:31,584][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:49:32,378][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:49:33,171][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:49:33,966][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:49:34,762][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:49:35,556][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:49:36,785][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:49:37,582][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:49:38,377][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:49:39,170][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:49:39,965][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:49:40,762][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:49:41,558][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:49:42,352][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:49:43,146][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:49:43,943][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:49:44,740][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:49:45,536][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:49:46,331][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:49:47,124][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:49:48,733][mllm.training.trainer_common][INFO] - For task: Apply reinforce step
,
            ΔVRAM Allocated: 2.288818359375e-05 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:26,
            Percentage of VRAM taken: 69.91579512747062%,
            
[2025-08-20 12:49:49,931][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/policy_optimizer_state.pt
[2025-08-20 12:49:49,934][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/critic_optimizer_state.pt
[2025-08-20 12:49:51,595][__main__][INFO] - Iteration 297 took 55s (37.29% Gen, 62.71% Train). Generation: 20s, Training: 34s. Estimated remaining time: 10h 44m 36s. Estimated total time: 15h 23m 57s. Time estimates for 10 more iterations: 9m 14s, 100 more iterations: 1h 32m 23s, 500 more iterations: 7h 41m 58s.
[2025-08-20 12:49:51,597][__main__][INFO] - Starting iteration 297.
[2025-08-20 12:50:15,093][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 12:50:15,094][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 12:50:15,101][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 12:50:17,563][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 12:50:17,564][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 12:50:17,571][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 12:50:17,573][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data.
[2025-08-20 12:50:17,573][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets.
[2025-08-20 12:50:17,878][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:50:18,669][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:50:19,461][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:50:20,253][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:50:21,049][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:50:21,843][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:50:22,636][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:50:23,430][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:50:24,225][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:50:25,018][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:50:25,811][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:50:26,607][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:50:27,402][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:50:28,195][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:50:28,989][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:50:29,786][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:50:30,582][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:50:31,375][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:50:32,169][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:50:32,966][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:50:34,236][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:50:35,029][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:50:35,825][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:50:36,620][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:50:37,413][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:50:38,207][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:50:39,003][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:50:39,799][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:50:40,593][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:50:41,386][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:50:42,184][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:50:42,981][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:50:44,609][mllm.training.trainer_common][INFO] - For task: Apply reinforce step
,
            ΔVRAM Allocated: 2.288818359375e-05 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:26,
            Percentage of VRAM taken: 69.91579512747062%,
            
[2025-08-20 12:50:45,561][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/policy_optimizer_state.pt
[2025-08-20 12:50:45,562][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/critic_optimizer_state.pt
[2025-08-20 12:50:46,863][__main__][INFO] - Iteration 298 took 55s (38.03% Gen, 61.96% Train). Generation: 21s, Training: 34s. Estimated remaining time: 10h 40m 50s. Estimated total time: 15h 21m 5s. Time estimates for 10 more iterations: 9m 12s, 100 more iterations: 1h 32m 6s, 500 more iterations: 7h 40m 32s.
[2025-08-20 12:50:46,865][__main__][INFO] - Starting iteration 298.
[2025-08-20 12:51:09,941][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 12:51:09,943][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 12:51:09,949][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 12:51:12,421][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 12:51:12,422][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 12:51:12,429][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 12:51:12,431][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data.
[2025-08-20 12:51:12,432][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets.
[2025-08-20 12:51:12,730][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:51:13,519][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:51:14,313][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:51:15,107][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:51:15,900][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:51:16,693][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:51:17,488][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:51:18,281][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:51:19,074][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:51:19,871][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:51:20,665][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:51:21,458][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:51:22,252][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:51:23,049][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:51:23,844][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:51:24,638][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:51:25,433][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:51:26,230][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:51:27,472][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:51:28,268][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:51:29,063][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:51:29,857][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:51:30,652][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:51:31,448][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:51:32,245][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:51:33,039][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:51:33,833][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:51:34,629][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:51:35,428][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:51:36,225][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:51:37,020][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:51:37,815][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:51:39,475][mllm.training.trainer_common][INFO] - For task: Apply reinforce step
,
            ΔVRAM Allocated: 2.288818359375e-05 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:27,
            Percentage of VRAM taken: 69.91579512747062%,
            
[2025-08-20 12:51:40,441][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/policy_optimizer_state.pt
[2025-08-20 12:51:40,443][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/critic_optimizer_state.pt
[2025-08-20 12:51:41,884][__main__][INFO] - Iteration 299 took 55s (37.48% Gen, 62.52% Train). Generation: 20s, Training: 34s. Estimated remaining time: 10h 35m 47s. Estimated total time: 15h 16m 58s. Time estimates for 10 more iterations: 9m 10s, 100 more iterations: 1h 31m 41s, 500 more iterations: 7h 38m 29s.
[2025-08-20 12:51:41,885][__main__][INFO] - Starting iteration 299.
[2025-08-20 12:52:05,039][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 12:52:05,040][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 12:52:05,047][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 12:52:07,502][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 12:52:07,504][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 12:52:07,511][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 12:52:07,513][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data.
[2025-08-20 12:52:07,513][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets.
[2025-08-20 12:52:07,812][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:52:08,602][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:52:09,395][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:52:10,189][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:52:10,981][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:52:11,775][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:52:12,570][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:52:13,362][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:52:14,156][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:52:14,952][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:52:15,745][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:52:16,538][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:52:17,332][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:52:18,127][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:52:18,920][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:52:19,714][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:52:20,511][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:52:21,307][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:52:22,102][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:52:22,895][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:52:23,692][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:52:24,488][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:52:25,283][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:52:26,076][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:52:26,872][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:52:27,669][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:52:28,465][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:52:29,259][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:52:30,504][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:52:31,298][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:52:32,093][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:52:32,890][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:52:34,496][mllm.training.trainer_common][INFO] - For task: Apply reinforce step
,
            ΔVRAM Allocated: 2.288818359375e-05 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:26,
            Percentage of VRAM taken: 69.91579512747062%,
            
[2025-08-20 12:52:35,594][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/policy_optimizer_state.pt
[2025-08-20 12:52:35,596][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/critic_optimizer_state.pt
[2025-08-20 12:52:37,050][__main__][INFO] - Iteration 300 took 55s (37.55% Gen, 62.45% Train). Generation: 20s, Training: 34s. Estimated remaining time: 10h 37m 18s. Estimated total time: 15h 19m 24s. Time estimates for 10 more iterations: 9m 11s, 100 more iterations: 1h 31m 56s, 500 more iterations: 7h 39m 42s.
[2025-08-20 12:52:37,052][__main__][INFO] - Starting iteration 300.
[2025-08-20 12:53:00,102][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 12:53:00,104][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 12:53:00,110][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 12:53:02,569][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 12:53:02,571][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 12:53:02,577][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 12:53:02,579][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data.
[2025-08-20 12:53:02,580][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets.
[2025-08-20 12:53:02,880][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:53:03,673][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:53:04,465][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:53:05,258][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:53:06,053][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:53:06,845][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:53:07,638][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:53:08,434][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:53:09,227][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:53:10,020][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:53:10,817][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:53:11,612][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:53:12,406][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:53:13,199][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:53:13,996][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:53:14,791][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:53:15,585][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:53:16,379][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:53:17,175][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:53:17,972][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:53:19,247][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:53:20,040][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:53:20,835][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:53:21,632][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:53:22,427][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:53:23,221][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:53:24,015][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:53:24,813][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:53:25,612][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:53:26,407][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:53:27,201][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:53:27,996][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:53:29,620][mllm.training.trainer_common][INFO] - For task: Apply reinforce step
,
            ΔVRAM Allocated: 2.288818359375e-05 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:27,
            Percentage of VRAM taken: 69.91579512747062%,
            
[2025-08-20 12:53:30,581][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/policy_optimizer_state.pt
[2025-08-20 12:53:30,583][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/critic_optimizer_state.pt
[2025-08-20 12:53:34,419][__main__][INFO] - Iteration 301 took 57s (35.89% Gen, 59.70% Train). Generation: 20s, Training: 34s. Estimated remaining time: 11h 13m 3s. Estimated total time: 15h 56m 6s. Time estimates for 10 more iterations: 9m 33s, 100 more iterations: 1h 35m 36s, 500 more iterations: 7h 58m 3s.
[2025-08-20 12:53:34,420][__main__][INFO] - Starting iteration 301.
[2025-08-20 12:53:57,562][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 12:53:57,563][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 12:53:57,570][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 12:54:00,008][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 12:54:00,009][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 12:54:00,015][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 12:54:00,017][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data.
[2025-08-20 12:54:00,018][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets.
[2025-08-20 12:54:00,316][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:54:01,106][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:54:01,899][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:54:02,692][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:54:03,484][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:54:04,279][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:54:05,072][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:54:05,865][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:54:06,660][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:54:07,455][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:54:08,249][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:54:09,041][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:54:09,837][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:54:10,633][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:54:11,426][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:54:12,220][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:54:13,016][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:54:13,811][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:54:14,605][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:54:15,398][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:54:16,195][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:54:16,991][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:54:17,785][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:54:19,012][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:54:19,807][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:54:20,600][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:54:21,395][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:54:22,193][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:54:22,990][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:54:23,786][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:54:24,581][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:54:25,376][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:54:27,026][mllm.training.trainer_common][INFO] - For task: Apply reinforce step
,
            ΔVRAM Allocated: 2.288818359375e-05 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:26,
            Percentage of VRAM taken: 69.91579512747062%,
            
[2025-08-20 12:54:27,995][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/policy_optimizer_state.pt
[2025-08-20 12:54:27,997][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/critic_optimizer_state.pt
[2025-08-20 12:54:29,380][__main__][INFO] - Iteration 302 took 54s (37.66% Gen, 62.34% Train). Generation: 20s, Training: 34s. Estimated remaining time: 10h 32m 1s. Estimated total time: 15h 15m 59s. Time estimates for 10 more iterations: 9m 9s, 100 more iterations: 1h 31m 35s, 500 more iterations: 7h 37m 59s.
[2025-08-20 12:54:29,383][__main__][INFO] - Starting iteration 302.
[2025-08-20 12:54:52,943][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 12:54:52,944][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 12:54:52,951][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 12:54:55,417][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 12:54:55,418][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 12:54:55,425][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 12:54:55,427][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data.
[2025-08-20 12:54:55,428][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets.
[2025-08-20 12:54:55,727][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:54:56,519][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:54:57,310][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:54:58,103][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:54:58,898][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:54:59,691][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:55:00,483][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:55:01,279][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:55:02,074][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:55:02,868][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:55:03,663][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:55:04,459][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:55:05,254][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:55:06,046][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:55:06,841][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:55:07,638][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:55:08,433][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:55:09,227][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:55:10,021][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:55:10,818][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:55:11,614][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:55:12,847][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:55:13,640][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:55:14,437][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:55:15,231][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:55:16,024][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:55:16,819][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:55:17,615][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:55:18,412][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:55:19,205][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:55:19,999][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:55:20,797][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:55:22,415][mllm.training.trainer_common][INFO] - For task: Apply reinforce step
,
            ΔVRAM Allocated: 2.288818359375e-05 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:26,
            Percentage of VRAM taken: 69.91579512747062%,
            
[2025-08-20 12:55:23,491][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/policy_optimizer_state.pt
[2025-08-20 12:55:23,494][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/critic_optimizer_state.pt
[2025-08-20 12:55:24,929][__main__][INFO] - Iteration 303 took 55s (37.98% Gen, 62.02% Train). Generation: 21s, Training: 34s. Estimated remaining time: 10h 40m 47s. Estimated total time: 15h 25m 41s. Time estimates for 10 more iterations: 9m 15s, 100 more iterations: 1h 32m 34s, 500 more iterations: 7h 42m 50s.
[2025-08-20 12:55:24,932][__main__][INFO] - Starting iteration 303.
[2025-08-20 12:55:48,059][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 12:55:48,060][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 12:55:48,067][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 12:55:50,535][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 12:55:50,536][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 12:55:50,542][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 12:55:50,545][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data.
[2025-08-20 12:55:50,545][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets.
[2025-08-20 12:55:50,857][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:55:51,646][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:55:52,441][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:55:53,233][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:55:54,026][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:55:54,821][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:55:55,614][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:55:56,408][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:55:57,204][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:55:57,999][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:55:58,792][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:55:59,586][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:56:00,382][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:56:01,178][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:56:01,973][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:56:02,766][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:56:03,562][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:56:04,358][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:56:05,152][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:56:05,945][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:56:07,217][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:56:08,009][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:56:08,802][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:56:09,598][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:56:10,393][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:56:11,186][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:56:11,980][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:56:12,775][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:56:13,569][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:56:14,364][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:56:15,161][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:56:15,959][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:56:17,571][mllm.training.trainer_common][INFO] - For task: Apply reinforce step
,
            ΔVRAM Allocated: 2.288818359375e-05 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:26,
            Percentage of VRAM taken: 69.91579512747062%,
            
[2025-08-20 12:56:18,525][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/policy_optimizer_state.pt
[2025-08-20 12:56:18,527][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/critic_optimizer_state.pt
[2025-08-20 12:56:19,952][__main__][INFO] - Iteration 304 took 55s (37.59% Gen, 62.41% Train). Generation: 20s, Training: 34s. Estimated remaining time: 10h 31m 11s. Estimated total time: 15h 17m 0s. Time estimates for 10 more iterations: 9m 10s, 100 more iterations: 1h 31m 42s, 500 more iterations: 7h 38m 30s.
[2025-08-20 12:56:19,954][__main__][INFO] - Starting iteration 304.
[2025-08-20 12:56:43,125][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 12:56:43,126][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 12:56:43,132][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 12:56:45,573][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 12:56:45,575][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 12:56:45,581][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 12:56:45,583][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data.
[2025-08-20 12:56:45,584][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets.
[2025-08-20 12:56:45,887][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:56:46,677][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:56:47,467][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:56:48,259][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:56:49,051][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:56:49,845][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:56:50,637][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:56:51,429][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:56:52,224][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:56:53,016][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:56:53,808][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:56:54,604][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:56:55,396][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:56:56,188][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:56:56,984][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:56:57,780][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:56:58,572][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:56:59,366][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:57:00,584][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:57:01,378][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:57:02,171][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:57:02,965][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:57:03,762][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:57:04,556][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:57:05,350][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:57:06,145][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:57:06,941][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:57:07,737][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:57:08,532][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:57:09,327][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:57:10,123][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:57:10,921][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:57:12,542][mllm.training.trainer_common][INFO] - For task: Apply reinforce step
,
            ΔVRAM Allocated: 2.288818359375e-05 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:26,
            Percentage of VRAM taken: 69.91579512747062%,
            
[2025-08-20 12:57:13,649][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/policy_optimizer_state.pt
[2025-08-20 12:57:13,651][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/critic_optimizer_state.pt
[2025-08-20 12:57:15,214][__main__][INFO] - Iteration 305 took 55s (37.53% Gen, 62.47% Train). Generation: 20s, Training: 34s. Estimated remaining time: 10h 34m 15s. Estimated total time: 15h 20m 59s. Time estimates for 10 more iterations: 9m 12s, 100 more iterations: 1h 32m 5s, 500 more iterations: 7h 40m 29s.
[2025-08-20 12:57:15,216][__main__][INFO] - Starting iteration 305.
[2025-08-20 12:57:38,746][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 12:57:38,748][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 12:57:38,754][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 12:57:41,206][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 12:57:41,208][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 12:57:41,214][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 12:57:41,217][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data.
[2025-08-20 12:57:41,217][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets.
[2025-08-20 12:57:41,515][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:57:42,307][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:57:43,099][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:57:43,891][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:57:44,685][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:57:45,477][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:57:46,270][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:57:47,064][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:57:47,856][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:57:48,650][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:57:49,446][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:57:50,240][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:57:51,032][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:57:51,828][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:57:52,623][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:57:53,416][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:57:54,209][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:57:55,006][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:57:55,800][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:57:56,593][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:57:57,833][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:57:58,626][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:57:59,420][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:58:00,212][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:58:01,005][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:58:01,800][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:58:02,593][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:58:03,387][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:58:04,182][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:58:04,975][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:58:05,768][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:58:06,563][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:58:08,134][mllm.training.trainer_common][INFO] - For task: Apply reinforce step
,
            ΔVRAM Allocated: 2.288818359375e-05 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:26,
            Percentage of VRAM taken: 69.91579512747062%,
            
[2025-08-20 12:58:09,091][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/policy_optimizer_state.pt
[2025-08-20 12:58:09,093][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/critic_optimizer_state.pt
[2025-08-20 12:58:10,446][__main__][INFO] - Iteration 306 took 55s (38.16% Gen, 61.84% Train). Generation: 21s, Training: 34s. Estimated remaining time: 10h 32m 50s. Estimated total time: 15h 20m 29s. Time estimates for 10 more iterations: 9m 12s, 100 more iterations: 1h 32m 2s, 500 more iterations: 7h 40m 14s.
[2025-08-20 12:58:10,447][__main__][INFO] - Starting iteration 306.
[2025-08-20 12:58:34,568][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 12:58:34,569][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 12:58:34,575][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 12:58:37,035][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 12:58:37,036][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 12:58:37,042][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 12:58:37,045][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data.
[2025-08-20 12:58:37,045][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets.
[2025-08-20 12:58:37,343][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:58:38,130][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:58:38,920][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:58:39,711][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:58:40,502][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:58:41,292][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:58:42,085][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:58:42,876][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:58:43,670][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:58:44,462][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:58:45,254][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:58:46,048][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:58:46,842][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:58:47,633][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:58:48,429][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:58:49,222][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:58:50,014][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:58:50,810][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:58:51,605][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:58:52,398][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:58:53,190][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:58:53,986][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:58:54,779][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:58:56,103][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:58:56,896][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:58:57,688][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:58:58,485][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:58:59,279][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:59:00,073][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:59:00,867][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:59:01,664][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:59:02,457][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:59:04,079][mllm.training.trainer_common][INFO] - For task: Apply reinforce step
,
            ΔVRAM Allocated: 2.288818359375e-05 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:26,
            Percentage of VRAM taken: 69.91579512747062%,
            
[2025-08-20 12:59:05,067][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/policy_optimizer_state.pt
[2025-08-20 12:59:05,068][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/critic_optimizer_state.pt
[2025-08-20 12:59:06,669][__main__][INFO] - Iteration 307 took 56s (38.57% Gen, 61.42% Train). Generation: 21s, Training: 34s. Estimated remaining time: 10h 48m 25s. Estimated total time: 15h 37m 1s. Time estimates for 10 more iterations: 9m 22s, 100 more iterations: 1h 33m 42s, 500 more iterations: 7h 48m 30s.
[2025-08-20 12:59:06,671][__main__][INFO] - Starting iteration 307.
[2025-08-20 12:59:30,577][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 12:59:30,578][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 12:59:30,585][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 12:59:33,064][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 12:59:33,065][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 12:59:33,071][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 12:59:33,074][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data.
[2025-08-20 12:59:33,074][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets.
[2025-08-20 12:59:33,380][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:59:34,172][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:59:34,963][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:59:35,755][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:59:36,549][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:59:37,341][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:59:38,135][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:59:38,929][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:59:39,722][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:59:40,515][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:59:41,311][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:59:42,105][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:59:42,897][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:59:43,693][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:59:44,488][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:59:45,281][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:59:46,484][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:59:47,278][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:59:48,072][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:59:48,869][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:59:49,663][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:59:50,456][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:59:51,252][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:59:52,049][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:59:52,844][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:59:53,637][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:59:54,432][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:59:55,228][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:59:56,025][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:59:56,820][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:59:57,613][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:59:58,409][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 12:59:59,972][mllm.training.trainer_common][INFO] - For task: Apply reinforce step
,
            ΔVRAM Allocated: 2.288818359375e-05 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:26,
            Percentage of VRAM taken: 69.91579512747062%,
            
[2025-08-20 13:00:00,937][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/policy_optimizer_state.pt
[2025-08-20 13:00:00,938][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/critic_optimizer_state.pt
[2025-08-20 13:00:02,865][__main__][INFO] - Iteration 308 took 56s (38.14% Gen, 61.86% Train). Generation: 21s, Training: 34s. Estimated remaining time: 10h 47m 1s. Estimated total time: 15h 36m 33s. Time estimates for 10 more iterations: 9m 21s, 100 more iterations: 1h 33m 39s, 500 more iterations: 7h 48m 16s.
[2025-08-20 13:00:02,866][__main__][INFO] - Starting iteration 308.
[2025-08-20 13:00:25,999][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 13:00:26,001][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 13:00:26,007][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 13:00:28,475][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 13:00:28,476][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 13:00:28,483][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 13:00:28,485][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data.
[2025-08-20 13:00:28,486][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets.
[2025-08-20 13:00:28,784][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:00:29,576][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:00:30,367][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:00:31,159][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:00:31,953][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:00:32,744][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:00:33,537][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:00:34,333][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:00:35,126][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:00:35,919][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:00:36,715][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:00:37,510][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:00:38,303][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:00:39,097][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:00:39,893][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:00:40,687][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:00:41,480][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:00:42,275][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:00:43,071][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:00:43,866][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:00:44,659][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:00:45,454][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:00:46,250][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:00:47,483][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:00:48,277][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:00:49,073][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:00:49,869][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:00:50,664][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:00:51,457][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:00:52,252][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:00:53,048][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:00:53,843][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:00:55,402][mllm.training.trainer_common][INFO] - For task: Apply reinforce step
,
            ΔVRAM Allocated: 2.288818359375e-05 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:26,
            Percentage of VRAM taken: 69.91579512747062%,
            
[2025-08-20 13:00:56,394][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/policy_optimizer_state.pt
[2025-08-20 13:00:56,396][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/critic_optimizer_state.pt
[2025-08-20 13:00:57,729][__main__][INFO] - Iteration 309 took 54s (37.69% Gen, 62.31% Train). Generation: 20s, Training: 34s. Estimated remaining time: 10h 23m 55s. Estimated total time: 15h 14m 22s. Time estimates for 10 more iterations: 9m 8s, 100 more iterations: 1h 31m 26s, 500 more iterations: 7h 37m 11s.
[2025-08-20 13:00:57,730][__main__][INFO] - Starting iteration 309.
[2025-08-20 13:01:20,828][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 13:01:20,829][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 13:01:20,836][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 13:01:23,291][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 13:01:23,292][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 13:01:23,298][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 13:01:23,301][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data.
[2025-08-20 13:01:23,301][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets.
[2025-08-20 13:01:23,599][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:01:24,389][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:01:25,181][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:01:25,975][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:01:26,766][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:01:27,560][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:01:28,354][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:01:29,147][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:01:29,941][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:01:30,738][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:01:31,532][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:01:32,324][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:01:33,119][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:01:33,915][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:01:34,708][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:01:35,501][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:01:36,298][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:01:37,094][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:01:37,887][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:01:38,680][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:01:39,478][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:01:40,747][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:01:41,540][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:01:42,335][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:01:43,131][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:01:43,924][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:01:44,719][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:01:45,515][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:01:46,312][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:01:47,108][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:01:47,902][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:01:48,697][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:01:50,279][mllm.training.trainer_common][INFO] - For task: Apply reinforce step
,
            ΔVRAM Allocated: 2.288818359375e-05 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:26,
            Percentage of VRAM taken: 69.91579512747062%,
            
[2025-08-20 13:01:51,286][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/policy_optimizer_state.pt
[2025-08-20 13:01:51,288][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/critic_optimizer_state.pt
[2025-08-20 13:01:52,632][__main__][INFO] - Iteration 310 took 54s (37.61% Gen, 62.38% Train). Generation: 20s, Training: 34s. Estimated remaining time: 10h 23m 39s. Estimated total time: 15h 15m 1s. Time estimates for 10 more iterations: 9m 9s, 100 more iterations: 1h 31m 30s, 500 more iterations: 7h 37m 30s.
[2025-08-20 13:01:52,633][__main__][INFO] - Starting iteration 310.
[2025-08-20 13:02:16,861][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 13:02:16,863][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 13:02:16,869][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 13:02:19,327][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 13:02:19,329][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 13:02:19,335][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 13:02:19,337][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data.
[2025-08-20 13:02:19,338][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets.
[2025-08-20 13:02:19,636][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:02:20,424][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:02:21,217][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:02:22,007][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:02:22,801][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:02:23,595][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:02:24,387][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:02:25,181][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:02:25,976][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:02:26,768][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:02:27,562][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:02:28,358][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:02:29,151][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:02:29,943][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:02:30,739][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:02:31,534][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:02:32,327][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:02:33,574][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:02:34,366][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:02:35,158][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:02:35,951][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:02:36,742][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:02:37,537][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:02:38,330][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:02:39,123][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:02:39,918][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:02:40,712][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:02:41,504][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:02:42,297][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:02:43,092][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:02:43,884][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:02:44,677][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:02:46,251][mllm.training.trainer_common][INFO] - For task: Apply reinforce step
,
            ΔVRAM Allocated: 2.288818359375e-05 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:26,
            Percentage of VRAM taken: 69.91579512747062%,
            
[2025-08-20 13:02:47,541][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/policy_optimizer_state.pt
[2025-08-20 13:02:47,544][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/critic_optimizer_state.pt
[2025-08-20 13:02:48,993][__main__][INFO] - Iteration 311 took 56s (38.63% Gen, 61.37% Train). Generation: 21s, Training: 34s. Estimated remaining time: 10h 47m 0s. Estimated total time: 15h 39m 18s. Time estimates for 10 more iterations: 9m 23s, 100 more iterations: 1h 33m 55s, 500 more iterations: 7h 49m 39s.
[2025-08-20 13:02:48,994][__main__][INFO] - Starting iteration 311.
[2025-08-20 13:03:12,571][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 13:03:12,573][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 13:03:12,579][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 13:03:15,021][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 13:03:15,022][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 13:03:15,029][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 13:03:15,031][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data.
[2025-08-20 13:03:15,031][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets.
[2025-08-20 13:03:15,332][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:03:16,122][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:03:16,913][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:03:17,705][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:03:18,498][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:03:19,290][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:03:20,084][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:03:20,879][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:03:21,671][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:03:22,465][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:03:23,261][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:03:24,056][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:03:24,849][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:03:25,644][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:03:26,440][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:03:27,233][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:03:28,496][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:03:29,289][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:03:30,084][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:03:30,880][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:03:31,676][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:03:32,469][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:03:33,263][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:03:34,059][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:03:34,855][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:03:35,649][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:03:36,442][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:03:37,239][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:03:38,035][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:03:38,829][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:03:39,623][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:03:40,419][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:03:42,007][mllm.training.trainer_common][INFO] - For task: Apply reinforce step
,
            ΔVRAM Allocated: 2.288818359375e-05 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:26,
            Percentage of VRAM taken: 69.91579512747062%,
            
[2025-08-20 13:03:43,103][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/policy_optimizer_state.pt
[2025-08-20 13:03:43,106][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/critic_optimizer_state.pt
[2025-08-20 13:03:44,447][__main__][INFO] - Iteration 312 took 55s (38.11% Gen, 61.88% Train). Generation: 21s, Training: 34s. Estimated remaining time: 10h 30m 59s. Estimated total time: 15h 24m 12s. Time estimates for 10 more iterations: 9m 14s, 100 more iterations: 1h 32m 25s, 500 more iterations: 7h 42m 6s.
[2025-08-20 13:03:44,449][__main__][INFO] - Starting iteration 312.
[2025-08-20 13:04:07,892][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 13:04:07,893][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 13:04:07,899][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 13:04:10,334][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 13:04:10,336][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 13:04:10,342][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 13:04:10,344][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data.
[2025-08-20 13:04:10,345][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets.
[2025-08-20 13:04:10,643][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:04:11,430][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:04:12,223][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:04:13,014][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:04:13,806][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:04:14,599][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:04:15,390][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:04:16,185][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:04:16,978][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:04:17,769][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:04:18,565][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:04:19,358][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:04:20,150][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:04:20,945][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:04:21,740][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:04:22,532][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:04:23,326][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:04:24,122][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:04:24,916][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:04:26,141][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:04:26,936][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:04:27,731][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:04:28,525][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:04:29,323][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:04:30,120][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:04:30,913][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:04:31,706][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:04:32,502][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:04:33,299][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:04:34,093][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:04:34,887][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:04:35,683][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:04:37,268][mllm.training.trainer_common][INFO] - For task: Apply reinforce step
,
            ΔVRAM Allocated: 2.288818359375e-05 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:26,
            Percentage of VRAM taken: 69.91579512747062%,
            
[2025-08-20 13:04:38,301][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/policy_optimizer_state.pt
[2025-08-20 13:04:38,303][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/critic_optimizer_state.pt
[2025-08-20 13:04:39,832][__main__][INFO] - Iteration 313 took 55s (37.94% Gen, 62.06% Train). Generation: 21s, Training: 34s. Estimated remaining time: 10h 28m 54s. Estimated total time: 15h 23m 3s. Time estimates for 10 more iterations: 9m 13s, 100 more iterations: 1h 32m 18s, 500 more iterations: 7h 41m 31s.
[2025-08-20 13:04:39,834][__main__][INFO] - Starting iteration 313.
[2025-08-20 13:05:02,980][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 13:05:02,982][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 13:05:02,988][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 13:05:05,405][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 13:05:05,406][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 13:05:05,413][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 13:05:05,415][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data.
[2025-08-20 13:05:05,415][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets.
[2025-08-20 13:05:05,713][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:05:06,503][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:05:07,292][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:05:08,086][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:05:08,876][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:05:09,668][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:05:10,460][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:05:11,250][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:05:12,044][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:05:12,836][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:05:13,628][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:05:14,421][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:05:15,213][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:05:16,007][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:05:16,799][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:05:17,591][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:05:18,386][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:05:19,179][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:05:19,971][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:05:20,766][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:05:22,001][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:05:22,796][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:05:23,590][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:05:24,386][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:05:25,184][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:05:25,980][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:05:26,774][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:05:27,568][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:05:28,365][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:05:29,164][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:05:29,961][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:05:30,756][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:05:32,307][mllm.training.trainer_common][INFO] - For task: Apply reinforce step
,
            ΔVRAM Allocated: 2.288818359375e-05 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:26,
            Percentage of VRAM taken: 69.91579512747062%,
            
[2025-08-20 13:05:33,244][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/policy_optimizer_state.pt
[2025-08-20 13:05:33,247][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/critic_optimizer_state.pt
[2025-08-20 13:05:35,351][__main__][INFO] - Iteration 314 took 55s (37.28% Gen, 62.72% Train). Generation: 20s, Training: 34s. Estimated remaining time: 10h 30m 12s. Estimated total time: 15h 25m 16s. Time estimates for 10 more iterations: 9m 15s, 100 more iterations: 1h 32m 31s, 500 more iterations: 7h 42m 38s.
[2025-08-20 13:05:35,352][__main__][INFO] - Starting iteration 314.
[2025-08-20 13:05:59,042][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 13:05:59,044][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 13:05:59,050][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 13:06:01,515][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 13:06:01,516][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 13:06:01,523][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 13:06:01,525][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data.
[2025-08-20 13:06:01,526][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets.
[2025-08-20 13:06:01,826][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:06:02,615][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:06:03,408][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:06:04,199][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:06:04,992][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:06:05,784][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:06:06,576][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:06:07,371][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:06:08,166][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:06:08,958][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:06:09,752][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:06:10,548][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:06:11,342][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:06:12,135][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:06:12,931][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:06:13,726][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:06:14,520][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:06:15,314][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:06:16,110][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:06:16,907][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:06:18,176][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:06:18,970][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:06:19,767][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:06:20,562][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:06:21,356][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:06:22,150][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:06:22,945][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:06:23,739][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:06:24,532][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:06:25,327][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:06:26,125][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:06:26,920][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:06:28,485][mllm.training.trainer_common][INFO] - For task: Apply reinforce step
,
            ΔVRAM Allocated: 2.288818359375e-05 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:26,
            Percentage of VRAM taken: 69.91579512747062%,
            
[2025-08-20 13:06:29,604][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/policy_optimizer_state.pt
[2025-08-20 13:06:29,606][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/critic_optimizer_state.pt
[2025-08-20 13:06:31,124][__main__][INFO] - Iteration 315 took 55s (38.07% Gen, 61.93% Train). Generation: 21s, Training: 34s. Estimated remaining time: 10h 33m 30s. Estimated total time: 15h 29m 30s. Time estimates for 10 more iterations: 9m 17s, 100 more iterations: 1h 32m 57s, 500 more iterations: 7h 44m 45s.
[2025-08-20 13:06:31,125][__main__][INFO] - Starting iteration 315.
[2025-08-20 13:06:55,298][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 13:06:55,300][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 13:06:55,306][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 13:06:57,765][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 13:06:57,767][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 13:06:57,773][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 13:06:57,775][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data.
[2025-08-20 13:06:57,776][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets.
[2025-08-20 13:06:58,073][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:06:58,860][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:06:59,652][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:07:00,441][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:07:01,232][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:07:02,021][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:07:02,813][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:07:03,603][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:07:04,394][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:07:05,186][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:07:05,978][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:07:06,771][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:07:07,563][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:07:08,353][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:07:09,149][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:07:09,940][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:07:10,732][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:07:11,526][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:07:12,317][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:07:13,111][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:07:13,904][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:07:15,133][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:07:15,930][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:07:16,725][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:07:17,518][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:07:18,314][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:07:19,111][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:07:19,909][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:07:20,705][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:07:21,499][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:07:22,293][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:07:23,089][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:07:24,676][mllm.training.trainer_common][INFO] - For task: Apply reinforce step
,
            ΔVRAM Allocated: 2.288818359375e-05 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:26,
            Percentage of VRAM taken: 69.91579512747062%,
            
[2025-08-20 13:07:25,574][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/policy_optimizer_state.pt
[2025-08-20 13:07:25,575][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/critic_optimizer_state.pt
[2025-08-20 13:07:27,100][__main__][INFO] - Iteration 316 took 55s (38.81% Gen, 61.19% Train). Generation: 21s, Training: 34s. Estimated remaining time: 10h 35m 58s. Estimated total time: 15h 32m 54s. Time estimates for 10 more iterations: 9m 19s, 100 more iterations: 1h 33m 17s, 500 more iterations: 7h 46m 27s.
[2025-08-20 13:07:27,102][__main__][INFO] - Starting iteration 316.
[2025-08-20 13:07:50,940][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 13:07:50,941][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 13:07:50,947][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 13:07:53,402][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 13:07:53,403][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 13:07:53,409][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 13:07:53,412][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data.
[2025-08-20 13:07:53,412][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets.
[2025-08-20 13:07:53,711][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:07:54,501][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:07:55,296][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:07:56,089][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:07:56,881][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:07:57,676][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:07:58,468][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:07:59,261][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:08:00,055][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:08:00,849][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:08:01,641][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:08:02,435][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:08:03,230][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:08:04,023][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:08:04,816][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:08:05,612][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:08:06,406][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:08:07,199][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:08:07,994][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:08:08,790][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:08:09,584][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:08:10,377][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:08:11,587][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:08:12,380][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:08:13,175][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:08:13,971][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:08:14,767][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:08:15,560][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:08:16,355][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:08:17,150][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:08:17,946][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:08:18,739][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:08:20,294][mllm.training.trainer_common][INFO] - For task: Apply reinforce step
,
            ΔVRAM Allocated: 2.288818359375e-05 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:26,
            Percentage of VRAM taken: 69.91579512747062%,
            
[2025-08-20 13:08:21,234][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/policy_optimizer_state.pt
[2025-08-20 13:08:21,235][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/critic_optimizer_state.pt
[2025-08-20 13:08:23,632][__main__][INFO] - Iteration 317 took 56s (37.82% Gen, 62.18% Train). Generation: 21s, Training: 35s. Estimated remaining time: 10h 44m 17s. Estimated total time: 15h 42m 10s. Time estimates for 10 more iterations: 9m 25s, 100 more iterations: 1h 34m 13s, 500 more iterations: 7h 51m 5s.
[2025-08-20 13:08:23,634][__main__][INFO] - Starting iteration 317.
[2025-08-20 13:08:46,814][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 13:08:46,815][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 13:08:46,822][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 13:08:49,313][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 13:08:49,315][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 13:08:49,321][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 13:08:49,323][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data.
[2025-08-20 13:08:49,324][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets.
[2025-08-20 13:08:49,624][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:08:50,417][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:08:51,209][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:08:52,003][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:08:52,799][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:08:53,592][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:08:54,383][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:08:55,178][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:08:55,971][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:08:56,763][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:08:57,558][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:08:58,352][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:08:59,145][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:08:59,940][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:09:00,736][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:09:01,531][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:09:02,806][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:09:03,599][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:09:04,395][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:09:05,190][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:09:05,983][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:09:06,778][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:09:07,572][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:09:08,364][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:09:09,158][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:09:09,955][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:09:10,753][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:09:11,548][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:09:12,341][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:09:13,136][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:09:13,933][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:09:14,730][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:09:16,339][mllm.training.trainer_common][INFO] - For task: Apply reinforce step
,
            ΔVRAM Allocated: 2.288818359375e-05 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:26,
            Percentage of VRAM taken: 69.91579512747062%,
            
[2025-08-20 13:09:17,324][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/policy_optimizer_state.pt
[2025-08-20 13:09:17,326][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/critic_optimizer_state.pt
[2025-08-20 13:09:19,133][__main__][INFO] - Iteration 318 took 55s (37.27% Gen, 62.73% Train). Generation: 20s, Training: 34s. Estimated remaining time: 10h 26m 10s. Estimated total time: 15h 24m 58s. Time estimates for 10 more iterations: 9m 14s, 100 more iterations: 1h 32m 29s, 500 more iterations: 7h 42m 29s.
[2025-08-20 13:09:19,135][__main__][INFO] - Starting iteration 318.
[2025-08-20 13:09:44,307][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 13:09:44,309][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 13:09:44,315][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 13:09:46,764][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 13:09:46,766][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 13:09:46,772][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 13:09:46,775][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data.
[2025-08-20 13:09:46,775][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets.
[2025-08-20 13:09:47,082][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:09:47,873][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:09:48,664][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:09:49,459][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:09:50,250][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:09:51,042][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:09:51,836][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:09:52,628][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:09:53,421][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:09:54,216][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:09:55,009][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:09:55,802][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:09:56,598][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:09:57,390][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:09:58,183][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:09:58,979][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:09:59,775][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:10:00,568][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:10:01,361][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:10:02,157][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:10:02,953][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:10:03,746][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:10:04,539][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:10:05,805][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:10:06,598][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:10:07,394][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:10:08,188][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:10:08,980][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:10:09,776][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:10:10,570][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:10:11,362][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:10:12,156][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:10:13,775][mllm.training.trainer_common][INFO] - For task: Apply reinforce step
,
            ΔVRAM Allocated: 2.288818359375e-05 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:26,
            Percentage of VRAM taken: 69.91579512747062%,
            
[2025-08-20 13:10:14,730][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/policy_optimizer_state.pt
[2025-08-20 13:10:14,732][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/critic_optimizer_state.pt
[2025-08-20 13:10:16,307][__main__][INFO] - Iteration 319 took 57s (39.77% Gen, 60.23% Train). Generation: 22s, Training: 34s. Estimated remaining time: 10h 53m 6s. Estimated total time: 15h 52m 51s. Time estimates for 10 more iterations: 9m 31s, 100 more iterations: 1h 35m 17s, 500 more iterations: 7h 56m 25s.
[2025-08-20 13:10:16,308][__main__][INFO] - Starting iteration 319.
[2025-08-20 13:10:39,379][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 13:10:39,380][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 13:10:39,387][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 13:10:41,851][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 13:10:41,852][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 13:10:41,859][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 13:10:41,861][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data.
[2025-08-20 13:10:41,862][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets.
[2025-08-20 13:10:42,165][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:10:42,956][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:10:43,747][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:10:44,541][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:10:45,333][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:10:46,125][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:10:46,920][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:10:47,712][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:10:48,505][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:10:49,300][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:10:50,092][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:10:50,885][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:10:51,682][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:10:52,475][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:10:53,268][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:10:54,063][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:10:54,859][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:10:55,652][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:10:56,445][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:10:57,241][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:10:58,035][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:10:59,259][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:11:00,054][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:11:00,847][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:11:01,642][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:11:02,436][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:11:03,231][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:11:04,025][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:11:04,820][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:11:05,616][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:11:06,411][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:11:07,204][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:11:08,763][mllm.training.trainer_common][INFO] - For task: Apply reinforce step
,
            ΔVRAM Allocated: 2.288818359375e-05 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:26,
            Percentage of VRAM taken: 69.91579512747062%,
            
[2025-08-20 13:11:09,739][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/policy_optimizer_state.pt
[2025-08-20 13:11:09,741][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/critic_optimizer_state.pt
[2025-08-20 13:11:11,263][__main__][INFO] - Iteration 320 took 54s (37.51% Gen, 62.49% Train). Generation: 20s, Training: 34s. Estimated remaining time: 10h 15m 14s. Estimated total time: 15h 15m 54s. Time estimates for 10 more iterations: 9m 9s, 100 more iterations: 1h 31m 35s, 500 more iterations: 7h 37m 57s.
[2025-08-20 13:11:11,264][__main__][INFO] - Starting iteration 320.
[2025-08-20 13:11:34,458][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 13:11:34,459][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 13:11:34,465][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 13:11:36,939][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 13:11:36,940][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 13:11:36,946][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 13:11:36,949][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data.
[2025-08-20 13:11:36,949][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets.
[2025-08-20 13:11:37,253][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:11:38,045][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:11:38,836][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:11:39,628][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:11:40,422][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:11:41,214][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:11:42,008][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:11:42,803][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:11:43,596][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:11:44,389][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:11:45,185][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:11:45,979][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:11:46,772][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:11:47,566][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:11:48,362][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:11:49,155][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:11:49,949][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:11:50,744][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:11:51,540][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:11:52,333][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:11:53,126][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:11:53,923][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:11:54,718][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:11:55,511][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:11:56,837][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:11:57,630][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:11:58,423][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:11:59,218][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:12:00,011][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:12:00,804][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:12:01,601][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:12:02,396][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:12:03,969][mllm.training.trainer_common][INFO] - For task: Apply reinforce step
,
            ΔVRAM Allocated: 2.288818359375e-05 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:26,
            Percentage of VRAM taken: 69.91579512747062%,
            
[2025-08-20 13:12:04,957][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/policy_optimizer_state.pt
[2025-08-20 13:12:04,959][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/critic_optimizer_state.pt
[2025-08-20 13:12:06,340][__main__][INFO] - Iteration 321 took 55s (37.62% Gen, 62.38% Train). Generation: 20s, Training: 34s. Estimated remaining time: 10h 16m 19s. Estimated total time: 15h 17m 55s. Time estimates for 10 more iterations: 9m 10s, 100 more iterations: 1h 31m 47s, 500 more iterations: 7h 38m 57s.
[2025-08-20 13:12:06,342][__main__][INFO] - Starting iteration 321.
[2025-08-20 13:12:29,451][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 13:12:29,453][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 13:12:29,459][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 13:12:31,924][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 13:12:31,925][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 13:12:31,931][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 13:12:31,934][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data.
[2025-08-20 13:12:31,934][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets.
[2025-08-20 13:12:32,232][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:12:33,024][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:12:33,815][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:12:34,609][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:12:35,402][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:12:36,193][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:12:36,988][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:12:37,782][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:12:38,574][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:12:39,369][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:12:40,164][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:12:40,957][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:12:41,750][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:12:42,547][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:12:43,341][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:12:44,135][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:12:44,929][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:12:46,163][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:12:46,957][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:12:47,751][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:12:48,546][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:12:49,342][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:12:50,135][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:12:50,929][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:12:51,725][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:12:52,520][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:12:53,314][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:12:54,108][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:12:54,905][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:12:55,702][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:12:56,496][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:12:57,290][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:12:58,898][mllm.training.trainer_common][INFO] - For task: Apply reinforce step
,
            ΔVRAM Allocated: 2.288818359375e-05 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:26,
            Percentage of VRAM taken: 69.91579512747062%,
            
[2025-08-20 13:13:00,055][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/policy_optimizer_state.pt
[2025-08-20 13:13:00,058][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/critic_optimizer_state.pt
[2025-08-20 13:13:01,570][__main__][INFO] - Iteration 322 took 55s (37.36% Gen, 62.64% Train). Generation: 20s, Training: 34s. Estimated remaining time: 10h 17m 57s. Estimated total time: 15h 20m 28s. Time estimates for 10 more iterations: 9m 12s, 100 more iterations: 1h 32m 2s, 500 more iterations: 7h 40m 14s.
[2025-08-20 13:13:01,572][__main__][INFO] - Starting iteration 322.
[2025-08-20 13:13:24,966][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 13:13:24,968][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 13:13:24,974][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 13:13:27,427][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 13:13:27,429][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 13:13:27,435][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 13:13:27,437][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data.
[2025-08-20 13:13:27,438][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets.
[2025-08-20 13:13:27,736][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:13:28,528][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:13:29,319][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:13:30,112][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:13:30,906][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:13:31,698][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:13:32,493][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:13:33,288][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:13:34,080][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:13:34,873][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:13:35,668][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:13:36,460][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:13:37,253][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:13:38,049][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:13:38,841][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:13:39,634][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:13:40,429][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:13:41,224][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:13:42,503][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:13:43,296][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:13:44,090][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:13:44,886][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:13:45,680][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:13:46,474][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:13:47,269][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:13:48,065][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:13:48,859][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:13:49,652][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:13:50,448][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:13:51,243][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:13:52,037][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:13:52,830][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:13:54,391][mllm.training.trainer_common][INFO] - For task: Apply reinforce step
,
            ΔVRAM Allocated: 2.288818359375e-05 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:26,
            Percentage of VRAM taken: 69.91579512747062%,
            
[2025-08-20 13:13:55,331][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/policy_optimizer_state.pt
[2025-08-20 13:13:55,333][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/critic_optimizer_state.pt
[2025-08-20 13:13:56,687][__main__][INFO] - Iteration 323 took 55s (37.97% Gen, 62.03% Train). Generation: 20s, Training: 34s. Estimated remaining time: 10h 15m 9s. Estimated total time: 15h 18m 35s. Time estimates for 10 more iterations: 9m 11s, 100 more iterations: 1h 31m 51s, 500 more iterations: 7h 39m 17s.
[2025-08-20 13:13:56,689][__main__][INFO] - Starting iteration 323.
[2025-08-20 13:14:19,895][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 13:14:19,896][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 13:14:19,903][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 13:14:22,376][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 13:14:22,378][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 13:14:22,384][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 13:14:22,386][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data.
[2025-08-20 13:14:22,387][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets.
[2025-08-20 13:14:22,692][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:14:23,481][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:14:24,275][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:14:25,066][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:14:25,859][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:14:26,654][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:14:27,448][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:14:28,240][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:14:29,035][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:14:29,830][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:14:30,623][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:14:31,416][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:14:32,212][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:14:33,007][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:14:33,800][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:14:34,595][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:14:35,392][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:14:36,187][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:14:36,980][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:14:37,774][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:14:38,572][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:14:39,902][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:14:40,696][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:14:41,491][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:14:42,286][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:14:43,080][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:14:43,873][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:14:44,670][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:14:45,467][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:14:46,261][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:14:47,054][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:14:47,850][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:14:49,414][mllm.training.trainer_common][INFO] - For task: Apply reinforce step
,
            ΔVRAM Allocated: 2.288818359375e-05 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:26,
            Percentage of VRAM taken: 69.91579512747062%,
            
[2025-08-20 13:14:50,464][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/policy_optimizer_state.pt
[2025-08-20 13:14:50,466][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/critic_optimizer_state.pt
[2025-08-20 13:14:51,851][__main__][INFO] - Iteration 324 took 55s (37.56% Gen, 62.44% Train). Generation: 20s, Training: 34s. Estimated remaining time: 10h 15m 0s. Estimated total time: 15h 19m 21s. Time estimates for 10 more iterations: 9m 11s, 100 more iterations: 1h 31m 56s, 500 more iterations: 7h 39m 40s.
[2025-08-20 13:14:51,852][__main__][INFO] - Starting iteration 324.
[2025-08-20 13:15:14,897][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 13:15:14,898][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 13:15:14,905][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 13:15:17,359][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 13:15:17,361][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 13:15:17,367][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 13:15:17,369][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data.
[2025-08-20 13:15:17,370][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets.
[2025-08-20 13:15:17,668][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:15:18,457][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:15:19,250][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:15:20,041][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:15:20,836][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:15:21,629][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:15:22,421][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:15:23,217][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:15:24,012][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:15:24,804][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:15:25,599][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:15:26,395][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:15:27,188][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:15:27,981][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:15:28,776][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:15:29,573][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:15:30,366][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:15:31,160][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:15:31,956][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:15:32,753][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:15:33,546][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:15:34,339][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:15:35,135][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:15:36,379][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:15:37,176][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:15:37,972][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:15:38,767][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:15:39,561][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:15:40,355][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:15:41,153][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:15:41,949][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:15:42,743][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:15:44,344][mllm.training.trainer_common][INFO] - For task: Apply reinforce step
,
            ΔVRAM Allocated: 2.288818359375e-05 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:26,
            Percentage of VRAM taken: 69.91579512747062%,
            
[2025-08-20 13:15:45,293][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/policy_optimizer_state.pt
[2025-08-20 13:15:45,294][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/critic_optimizer_state.pt
[2025-08-20 13:15:46,789][__main__][INFO] - Iteration 325 took 54s (37.46% Gen, 62.53% Train). Generation: 20s, Training: 34s. Estimated remaining time: 10h 10m 20s. Estimated total time: 15h 15m 36s. Time estimates for 10 more iterations: 9m 9s, 100 more iterations: 1h 31m 33s, 500 more iterations: 7h 37m 48s.
[2025-08-20 13:15:46,791][__main__][INFO] - Starting iteration 325.
[2025-08-20 13:16:10,419][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 13:16:10,421][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 13:16:10,427][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 13:16:12,867][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 13:16:12,868][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 13:16:12,875][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 13:16:12,877][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data.
[2025-08-20 13:16:12,877][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets.
[2025-08-20 13:16:13,175][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:16:13,964][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:16:14,759][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:16:15,550][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:16:16,343][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:16:17,139][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:16:17,933][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:16:18,726][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:16:19,520][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:16:20,316][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:16:21,109][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:16:21,903][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:16:22,699][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:16:23,494][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:16:24,288][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:16:25,082][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:16:25,876][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:16:26,671][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:16:27,464][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:16:28,259][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:16:29,476][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:16:30,270][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:16:31,064][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:16:31,859][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:16:32,656][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:16:33,452][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:16:34,246][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:16:35,039][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:16:35,836][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:16:36,632][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:16:37,427][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:16:38,220][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:16:39,797][mllm.training.trainer_common][INFO] - For task: Apply reinforce step
,
            ΔVRAM Allocated: 2.288818359375e-05 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:26,
            Percentage of VRAM taken: 69.91579512747062%,
            
[2025-08-20 13:16:40,747][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/policy_optimizer_state.pt
[2025-08-20 13:16:40,748][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/critic_optimizer_state.pt
[2025-08-20 13:16:42,147][__main__][INFO] - Iteration 326 took 55s (38.27% Gen, 61.72% Train). Generation: 21s, Training: 34s. Estimated remaining time: 10h 16m 24s. Estimated total time: 15h 22m 35s. Time estimates for 10 more iterations: 9m 13s, 100 more iterations: 1h 32m 15s, 500 more iterations: 7h 41m 17s.
[2025-08-20 13:16:42,148][__main__][INFO] - Starting iteration 326.
[2025-08-20 13:17:05,260][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 13:17:05,262][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 13:17:05,268][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 13:17:07,740][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 13:17:07,742][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 13:17:07,748][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 13:17:07,750][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data.
[2025-08-20 13:17:07,751][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets.
[2025-08-20 13:17:08,054][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:17:08,844][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:17:09,635][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:17:10,427][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:17:11,221][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:17:12,013][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:17:12,806][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:17:13,602][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:17:14,395][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:17:15,188][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:17:15,983][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:17:16,778][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:17:17,571][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:17:18,365][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:17:19,162][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:17:19,957][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:17:20,750][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:17:21,543][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:17:22,340][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:17:23,134][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:17:23,928][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:17:25,197][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:17:25,991][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:17:26,784][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:17:27,580][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:17:28,376][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:17:29,170][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:17:29,963][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:17:30,759][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:17:31,556][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:17:32,350][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:17:33,143][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:17:34,730][mllm.training.trainer_common][INFO] - For task: Apply reinforce step
,
            ΔVRAM Allocated: 2.288818359375e-05 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:26,
            Percentage of VRAM taken: 69.91579512747062%,
            
[2025-08-20 13:17:35,694][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/policy_optimizer_state.pt
[2025-08-20 13:17:35,696][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/critic_optimizer_state.pt
[2025-08-20 13:17:37,332][__main__][INFO] - Iteration 327 took 55s (37.41% Gen, 62.59% Train). Generation: 20s, Training: 34s. Estimated remaining time: 10h 12m 37s. Estimated total time: 15h 19m 43s. Time estimates for 10 more iterations: 9m 11s, 100 more iterations: 1h 31m 58s, 500 more iterations: 7h 39m 51s.
[2025-08-20 13:17:37,334][__main__][INFO] - Starting iteration 327.
[2025-08-20 13:18:02,272][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 13:18:02,275][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 13:18:02,282][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 13:18:04,723][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 13:18:04,724][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 13:18:04,730][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 13:18:04,733][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data.
[2025-08-20 13:18:04,733][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets.
[2025-08-20 13:18:05,076][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:18:05,871][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:18:06,662][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:18:07,451][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:18:08,244][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:18:09,033][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:18:09,825][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:18:10,617][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:18:11,407][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:18:12,202][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:18:12,993][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:18:13,785][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:18:14,612][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:18:15,404][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:18:16,197][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:18:16,988][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:18:17,782][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:18:18,576][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:18:19,368][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:18:20,162][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:18:21,470][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:18:22,264][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:18:23,060][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:18:23,853][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:18:24,646][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:18:25,443][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:18:26,240][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:18:27,034][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:18:27,828][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:18:28,622][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:18:29,421][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:18:30,217][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:18:31,832][mllm.training.trainer_common][INFO] - For task: Apply reinforce step
,
            ΔVRAM Allocated: 2.288818359375e-05 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:27,
            Percentage of VRAM taken: 69.91579512747062%,
            
[2025-08-20 13:18:32,865][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/policy_optimizer_state.pt
[2025-08-20 13:18:32,868][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/critic_optimizer_state.pt
[2025-08-20 13:18:34,209][__main__][INFO] - Iteration 328 took 56s (39.55% Gen, 60.45% Train). Generation: 22s, Training: 34s. Estimated remaining time: 10h 39m 51s. Estimated total time: 15h 47m 54s. Time estimates for 10 more iterations: 9m 28s, 100 more iterations: 1h 34m 47s, 500 more iterations: 7h 53m 57s.
[2025-08-20 13:18:34,213][__main__][INFO] - Starting iteration 328.
[2025-08-20 13:18:57,346][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 13:18:57,347][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 13:18:57,353][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 13:18:59,795][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 13:18:59,796][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 13:18:59,803][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 13:18:59,805][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data.
[2025-08-20 13:18:59,805][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets.
[2025-08-20 13:19:00,111][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:19:00,903][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:19:01,694][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:19:02,486][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:19:03,278][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:19:04,070][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:19:04,865][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:19:05,657][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:19:06,450][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:19:07,246][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:19:08,039][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:19:08,831][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:19:09,627][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:19:10,421][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:19:11,214][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:19:12,008][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:19:12,804][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:19:13,598][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:19:14,391][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:19:15,187][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:19:15,983][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:19:16,777][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:19:17,570][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:19:18,365][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:19:19,595][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:19:20,389][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:19:21,186][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:19:21,982][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:19:22,777][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:19:23,570][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:19:24,364][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:19:25,161][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:19:26,721][mllm.training.trainer_common][INFO] - For task: Apply reinforce step
,
            ΔVRAM Allocated: 2.288818359375e-05 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:26,
            Percentage of VRAM taken: 69.91579512747062%,
            
[2025-08-20 13:19:27,717][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/policy_optimizer_state.pt
[2025-08-20 13:19:27,718][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/critic_optimizer_state.pt
[2025-08-20 13:19:29,101][__main__][INFO] - Iteration 329 took 54s (37.68% Gen, 62.32% Train). Generation: 20s, Training: 34s. Estimated remaining time: 10h 5m 49s. Estimated total time: 15h 14m 47s. Time estimates for 10 more iterations: 9m 8s, 100 more iterations: 1h 31m 28s, 500 more iterations: 7h 37m 23s.
[2025-08-20 13:19:29,102][__main__][INFO] - Starting iteration 329.
[2025-08-20 13:19:52,214][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 13:19:52,216][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 13:19:52,222][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 13:19:54,667][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 13:19:54,668][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 13:19:54,675][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 13:19:54,677][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data.
[2025-08-20 13:19:54,677][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets.
[2025-08-20 13:19:54,982][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:19:55,771][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:19:56,563][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:19:57,354][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:19:58,150][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:19:58,942][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:19:59,736][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:20:00,531][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:20:01,326][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:20:02,118][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:20:02,912][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:20:03,707][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:20:04,500][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:20:05,293][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:20:06,089][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:20:06,884][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:20:07,677][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:20:08,471][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:20:09,269][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:20:10,064][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:20:10,857][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:20:11,651][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:20:12,448][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:20:13,245][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:20:14,038][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:20:14,832][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:20:16,115][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:20:16,908][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:20:17,704][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:20:18,498][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:20:19,290][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:20:20,085][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:20:21,651][mllm.training.trainer_common][INFO] - For task: Apply reinforce step
,
            ΔVRAM Allocated: 2.288818359375e-05 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:26,
            Percentage of VRAM taken: 69.91579512747062%,
            
[2025-08-20 13:20:22,614][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/policy_optimizer_state.pt
[2025-08-20 13:20:22,615][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/critic_optimizer_state.pt
[2025-08-20 13:20:23,982][__main__][INFO] - Iteration 330 took 54s (37.64% Gen, 62.36% Train). Generation: 20s, Training: 34s. Estimated remaining time: 10h 4m 46s. Estimated total time: 15h 14m 39s. Time estimates for 10 more iterations: 9m 8s, 100 more iterations: 1h 31m 27s, 500 more iterations: 7h 37m 19s.
[2025-08-20 13:20:23,984][__main__][INFO] - Starting iteration 330.
[2025-08-20 13:20:47,192][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 13:20:47,193][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 13:20:47,199][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 13:20:49,652][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 13:20:49,653][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 13:20:49,660][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 13:20:49,662][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data.
[2025-08-20 13:20:49,662][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets.
[2025-08-20 13:20:49,960][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:20:50,752][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:20:51,543][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:20:52,335][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:20:53,128][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:20:53,920][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:20:54,715][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:20:55,509][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:20:56,302][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:20:57,095][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:20:57,891][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:20:58,684][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:20:59,478][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:21:00,274][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:21:01,070][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:21:01,863][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:21:02,657][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:21:03,452][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:21:04,248][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:21:05,041][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:21:05,835][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:21:06,632][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:21:07,427][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:21:08,221][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:21:09,451][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:21:10,246][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:21:11,039][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:21:11,831][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:21:12,628][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:21:13,424][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:21:14,216][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:21:15,010][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:21:16,598][mllm.training.trainer_common][INFO] - For task: Apply reinforce step
,
            ΔVRAM Allocated: 2.288818359375e-05 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:26,
            Percentage of VRAM taken: 69.91579512747062%,
            
[2025-08-20 13:21:17,586][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/policy_optimizer_state.pt
[2025-08-20 13:21:17,588][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/critic_optimizer_state.pt
[2025-08-20 13:21:18,980][__main__][INFO] - Iteration 331 took 54s (37.72% Gen, 62.28% Train). Generation: 20s, Training: 34s. Estimated remaining time: 10h 5m 47s. Estimated total time: 15h 16m 35s. Time estimates for 10 more iterations: 9m 9s, 100 more iterations: 1h 31m 39s, 500 more iterations: 7h 38m 17s.
[2025-08-20 13:21:18,982][__main__][INFO] - Starting iteration 331.
[2025-08-20 13:21:42,051][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 13:21:42,053][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 13:21:42,059][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 13:21:44,522][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 13:21:44,523][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 13:21:44,530][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 13:21:44,532][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data.
[2025-08-20 13:21:44,533][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets.
[2025-08-20 13:21:44,831][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:21:45,622][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:21:46,417][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:21:47,211][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:21:48,002][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:21:48,797][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:21:49,591][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:21:50,383][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:21:51,178][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:21:51,973][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:21:52,766][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:21:53,559][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:21:54,355][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:21:55,149][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:21:55,942][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:21:56,736][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:21:57,531][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:21:58,324][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:21:59,558][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:22:00,350][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:22:01,141][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:22:01,933][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:22:02,726][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:22:03,517][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:22:04,311][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:22:05,104][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:22:05,897][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:22:06,691][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:22:07,485][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:22:08,277][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:22:09,071][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:22:09,865][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:22:11,421][mllm.training.trainer_common][INFO] - For task: Apply reinforce step
,
            ΔVRAM Allocated: 2.288818359375e-05 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:26,
            Percentage of VRAM taken: 69.91579512747062%,
            
[2025-08-20 13:22:12,385][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/policy_optimizer_state.pt
[2025-08-20 13:22:12,386][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/critic_optimizer_state.pt
[2025-08-20 13:22:13,797][__main__][INFO] - Iteration 332 took 54s (37.60% Gen, 62.39% Train). Generation: 20s, Training: 34s. Estimated remaining time: 10h 1m 52s. Estimated total time: 15h 13m 35s. Time estimates for 10 more iterations: 9m 8s, 100 more iterations: 1h 31m 21s, 500 more iterations: 7h 36m 47s.
[2025-08-20 13:22:13,799][__main__][INFO] - Starting iteration 332.
[2025-08-20 13:22:37,127][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 13:22:37,129][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 13:22:37,135][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 13:22:39,605][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 13:22:39,607][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 13:22:39,613][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 13:22:39,615][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data.
[2025-08-20 13:22:39,616][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets.
[2025-08-20 13:22:39,916][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:22:40,706][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:22:41,501][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:22:42,293][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:22:43,084][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:22:43,879][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:22:44,671][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:22:45,463][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:22:46,259][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:22:47,054][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:22:47,847][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:22:48,641][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:22:49,437][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:22:50,231][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:22:51,024][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:22:51,820][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:22:52,616][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:22:53,411][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:22:54,204][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:22:54,999][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:22:55,796][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:22:56,592][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:22:57,385][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:22:58,677][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:22:59,471][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:23:00,265][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:23:01,059][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:23:01,856][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:23:02,653][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:23:03,447][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:23:04,241][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:23:05,035][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:23:06,607][mllm.training.trainer_common][INFO] - For task: Apply reinforce step
,
            ΔVRAM Allocated: 2.288818359375e-05 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:26,
            Percentage of VRAM taken: 69.91579512747062%,
            
[2025-08-20 13:23:07,531][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/policy_optimizer_state.pt
[2025-08-20 13:23:07,532][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/critic_optimizer_state.pt
[2025-08-20 13:23:08,894][__main__][INFO] - Iteration 333 took 55s (37.87% Gen, 62.13% Train). Generation: 20s, Training: 34s. Estimated remaining time: 10h 5m 36s. Estimated total time: 15h 18m 14s. Time estimates for 10 more iterations: 9m 10s, 100 more iterations: 1h 31m 49s, 500 more iterations: 7h 39m 7s.
[2025-08-20 13:23:08,895][__main__][INFO] - Starting iteration 333.
[2025-08-20 13:23:32,055][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 13:23:32,056][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 13:23:32,062][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 13:23:34,505][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 13:23:34,506][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 13:23:34,513][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 13:23:34,515][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data.
[2025-08-20 13:23:34,516][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets.
[2025-08-20 13:23:34,819][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:23:35,610][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:23:36,404][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:23:37,197][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:23:37,988][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:23:38,787][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:23:39,581][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:23:40,373][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:23:41,166][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:23:41,961][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:23:42,754][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:23:43,546][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:23:44,342][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:23:45,138][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:23:45,932][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:23:46,725][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:23:47,521][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:23:48,316][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:23:49,110][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:23:49,904][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:23:50,701][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:23:51,496][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:23:52,797][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:23:53,591][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:23:54,384][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:23:55,180][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:23:55,977][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:23:56,773][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:23:57,567][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:23:58,361][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:23:59,158][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:23:59,955][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:24:01,520][mllm.training.trainer_common][INFO] - For task: Apply reinforce step
,
            ΔVRAM Allocated: 2.288818359375e-05 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:26,
            Percentage of VRAM taken: 69.91579512747062%,
            
[2025-08-20 13:24:02,790][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/policy_optimizer_state.pt
[2025-08-20 13:24:02,793][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/critic_optimizer_state.pt
[2025-08-20 13:24:04,198][__main__][INFO] - Iteration 334 took 55s (37.46% Gen, 62.54% Train). Generation: 20s, Training: 34s. Estimated remaining time: 10h 8m 8s. Estimated total time: 15h 21m 42s. Time estimates for 10 more iterations: 9m 13s, 100 more iterations: 1h 32m 10s, 500 more iterations: 7h 40m 51s.
[2025-08-20 13:24:04,199][__main__][INFO] - Starting iteration 334.
[2025-08-20 13:24:27,381][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 13:24:27,382][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 13:24:27,388][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 13:24:29,844][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 13:24:29,845][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 13:24:29,851][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 13:24:29,853][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data.
[2025-08-20 13:24:29,854][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets.
[2025-08-20 13:24:30,155][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:24:30,945][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:24:31,737][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:24:32,529][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:24:33,325][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:24:34,117][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:24:34,907][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:24:35,701][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:24:36,491][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:24:37,285][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:24:38,078][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:24:38,870][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:24:39,664][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:24:40,458][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:24:41,249][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:24:42,044][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:24:42,838][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:24:43,631][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:24:44,424][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:24:45,651][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:24:46,444][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:24:47,241][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:24:48,036][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:24:48,829][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:24:49,624][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:24:50,419][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:24:51,213][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:24:52,006][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:24:52,802][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:24:53,600][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:24:54,396][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:24:55,190][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:24:56,759][mllm.training.trainer_common][INFO] - For task: Apply reinforce step
,
            ΔVRAM Allocated: 2.288818359375e-05 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:26,
            Percentage of VRAM taken: 69.91579512747062%,
            
[2025-08-20 13:24:57,718][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/policy_optimizer_state.pt
[2025-08-20 13:24:57,720][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/critic_optimizer_state.pt
[2025-08-20 13:24:59,045][__main__][INFO] - Iteration 335 took 54s (37.80% Gen, 62.20% Train). Generation: 20s, Training: 34s. Estimated remaining time: 9h 59m 37s. Estimated total time: 15h 14m 5s. Time estimates for 10 more iterations: 9m 8s, 100 more iterations: 1h 31m 24s, 500 more iterations: 7h 37m 2s.
[2025-08-20 13:24:59,047][__main__][INFO] - Starting iteration 335.
[2025-08-20 13:25:22,241][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 13:25:22,243][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 13:25:22,249][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 13:25:24,723][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 13:25:24,724][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 13:25:24,731][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 13:25:24,734][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data.
[2025-08-20 13:25:24,734][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets.
[2025-08-20 13:25:25,035][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:25:25,828][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:25:26,622][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:25:27,415][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:25:28,211][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:25:29,008][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:25:29,804][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:25:30,598][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:25:31,393][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:25:32,189][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:25:32,984][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:25:33,776][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:25:34,570][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:25:35,366][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:25:36,159][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:25:36,952][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:25:37,748][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:25:38,544][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:25:39,810][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:25:40,606][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:25:41,402][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:25:42,195][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:25:42,989][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:25:43,785][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:25:44,581][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:25:45,375][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:25:46,169][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:25:46,964][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:25:47,760][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:25:48,553][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:25:49,347][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:25:50,145][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:25:51,745][mllm.training.trainer_common][INFO] - For task: Apply reinforce step
,
            ΔVRAM Allocated: 2.288818359375e-05 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:26,
            Percentage of VRAM taken: 69.91579512747062%,
            
[2025-08-20 13:25:52,761][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/policy_optimizer_state.pt
[2025-08-20 13:25:52,763][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/critic_optimizer_state.pt
[2025-08-20 13:25:54,163][__main__][INFO] - Iteration 336 took 55s (37.58% Gen, 62.41% Train). Generation: 20s, Training: 34s. Estimated remaining time: 10h 3m 12s. Estimated total time: 15h 18m 35s. Time estimates for 10 more iterations: 9m 11s, 100 more iterations: 1h 31m 51s, 500 more iterations: 7h 39m 17s.
[2025-08-20 13:25:54,164][__main__][INFO] - Starting iteration 336.
[2025-08-20 13:26:17,380][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 13:26:18,590][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 13:26:18,600][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 13:26:21,085][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 13:26:21,087][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 13:26:21,093][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 13:26:21,096][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data.
[2025-08-20 13:26:21,096][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets.
[2025-08-20 13:26:21,395][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:26:22,188][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:26:22,980][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:26:23,774][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:26:24,570][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:26:25,365][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:26:26,159][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:26:26,953][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:26:27,749][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:26:28,543][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:26:29,336][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:26:30,131][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:26:30,926][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:26:31,718][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:26:32,511][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:26:33,308][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:26:34,102][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:26:34,895][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:26:35,690][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:26:36,486][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:26:37,281][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:26:38,074][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:26:39,311][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:26:40,105][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:26:40,899][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:26:41,693][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:26:42,488][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:26:43,285][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:26:44,079][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:26:44,872][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:26:45,668][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:26:46,465][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:26:48,104][mllm.training.trainer_common][INFO] - For task: Apply reinforce step
,
            ΔVRAM Allocated: 2.288818359375e-05 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:26,
            Percentage of VRAM taken: 69.91579512747062%,
            
[2025-08-20 13:26:49,103][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/policy_optimizer_state.pt
[2025-08-20 13:26:49,104][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/critic_optimizer_state.pt
[2025-08-20 13:26:50,489][__main__][INFO] - Iteration 337 took 56s (36.79% Gen, 63.21% Train). Generation: 20s, Training: 35s. Estimated remaining time: 10h 22m 24s. Estimated total time: 15h 38m 44s. Time estimates for 10 more iterations: 9m 23s, 100 more iterations: 1h 33m 52s, 500 more iterations: 7h 49m 22s.
[2025-08-20 13:26:50,490][__main__][INFO] - Starting iteration 337.
[2025-08-20 13:27:13,879][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 13:27:13,881][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 13:27:13,887][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 13:27:16,346][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 13:27:16,347][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 13:27:16,354][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 13:27:16,356][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data.
[2025-08-20 13:27:16,357][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets.
[2025-08-20 13:27:16,655][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:27:17,447][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:27:18,239][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:27:19,035][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:27:19,829][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:27:20,623][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:27:21,417][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:27:22,213][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:27:23,008][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:27:23,799][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:27:24,594][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:27:25,388][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:27:26,181][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:27:26,974][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:27:27,770][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:27:28,564][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:27:29,358][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:27:30,152][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:27:30,948][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:27:31,741][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:27:32,534][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:27:33,330][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:27:34,626][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:27:35,419][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:27:36,213][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:27:37,009][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:27:37,803][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:27:38,597][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:27:39,391][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:27:40,188][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:27:40,984][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:27:41,778][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:27:43,348][mllm.training.trainer_common][INFO] - For task: Apply reinforce step
,
            ΔVRAM Allocated: 2.288818359375e-05 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:26,
            Percentage of VRAM taken: 69.91579512747062%,
            
[2025-08-20 13:27:44,546][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/policy_optimizer_state.pt
[2025-08-20 13:27:44,548][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/critic_optimizer_state.pt
[2025-08-20 13:27:45,987][__main__][INFO] - Iteration 338 took 55s (37.75% Gen, 62.25% Train). Generation: 20s, Training: 34s. Estimated remaining time: 10h 7m 41s. Estimated total time: 15h 24m 56s. Time estimates for 10 more iterations: 9m 14s, 100 more iterations: 1h 32m 29s, 500 more iterations: 7h 42m 28s.
[2025-08-20 13:27:45,989][__main__][INFO] - Starting iteration 338.
[2025-08-20 13:28:09,174][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 13:28:09,176][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 13:28:09,182][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 13:28:11,645][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 13:28:11,646][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 13:28:11,653][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 13:28:11,656][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data.
[2025-08-20 13:28:11,656][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets.
[2025-08-20 13:28:11,957][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:28:12,749][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:28:13,541][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:28:14,337][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:28:15,132][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:28:15,925][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:28:16,719][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:28:17,515][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:28:18,307][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:28:19,099][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:28:19,895][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:28:20,688][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:28:21,480][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:28:22,276][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:28:23,072][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:28:23,865][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:28:24,659][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:28:25,455][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:28:26,251][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:28:27,044][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:28:27,838][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:28:28,634][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:28:29,431][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:28:30,225][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:28:31,577][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:28:32,373][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:28:33,169][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:28:33,963][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:28:34,756][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:28:35,553][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:28:36,350][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:28:37,146][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:28:38,704][mllm.training.trainer_common][INFO] - For task: Apply reinforce step
,
            ΔVRAM Allocated: 2.288818359375e-05 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:27,
            Percentage of VRAM taken: 69.91579512747062%,
            
[2025-08-20 13:28:39,653][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/policy_optimizer_state.pt
[2025-08-20 13:28:39,654][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/critic_optimizer_state.pt
[2025-08-20 13:28:40,952][__main__][INFO] - Iteration 339 took 54s (37.72% Gen, 62.28% Train). Generation: 20s, Training: 34s. Estimated remaining time: 9h 57m 52s. Estimated total time: 15h 16m 2s. Time estimates for 10 more iterations: 9m 9s, 100 more iterations: 1h 31m 36s, 500 more iterations: 7h 38m 1s.
[2025-08-20 13:28:40,953][__main__][INFO] - Starting iteration 339.
[2025-08-20 13:29:03,982][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 13:29:03,983][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 13:29:03,990][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 13:29:06,450][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 13:29:06,451][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 13:29:06,458][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 13:29:06,461][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data.
[2025-08-20 13:29:06,461][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets.
[2025-08-20 13:29:06,759][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:29:07,548][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:29:08,340][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:29:09,132][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:29:09,924][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:29:10,719][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:29:11,511][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:29:12,301][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:29:13,095][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:29:13,886][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:29:14,678][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:29:15,470][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:29:16,260][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:29:17,055][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:29:17,847][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:29:18,639][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:29:19,433][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:29:20,225][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:29:21,018][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:29:21,814][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:29:22,606][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:29:23,398][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:29:24,194][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:29:24,986][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:29:26,292][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:29:27,086][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:29:27,880][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:29:28,676][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:29:29,475][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:29:30,271][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:29:31,065][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:29:31,859][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:29:33,452][mllm.training.trainer_common][INFO] - For task: Apply reinforce step
,
            ΔVRAM Allocated: 2.288818359375e-05 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:26,
            Percentage of VRAM taken: 69.91579512747062%,
            
[2025-08-20 13:29:34,521][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/policy_optimizer_state.pt
[2025-08-20 13:29:34,523][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/critic_optimizer_state.pt
[2025-08-20 13:29:35,856][__main__][INFO] - Iteration 340 took 54s (37.50% Gen, 62.49% Train). Generation: 20s, Training: 34s. Estimated remaining time: 9h 55m 58s. Estimated total time: 15h 15m 3s. Time estimates for 10 more iterations: 9m 9s, 100 more iterations: 1h 31m 30s, 500 more iterations: 7h 37m 31s.
[2025-08-20 13:29:35,858][__main__][INFO] - Starting iteration 340.
[2025-08-20 13:29:58,852][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 13:29:58,853][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 13:29:58,859][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 13:30:01,317][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 13:30:01,319][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 13:30:01,325][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 13:30:01,328][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data.
[2025-08-20 13:30:01,328][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets.
[2025-08-20 13:30:01,628][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:30:02,421][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:30:03,214][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:30:04,007][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:30:04,804][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:30:05,600][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:30:06,394][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:30:07,187][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:30:07,982][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:30:08,776][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:30:09,569][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:30:10,361][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:30:11,157][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:30:11,950][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:30:12,743][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:30:13,539][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:30:14,334][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:30:15,127][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:30:15,922][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:30:16,719][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:30:17,953][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:30:18,747][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:30:19,541][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:30:20,338][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:30:21,133][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:30:21,927][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:30:22,721][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:30:23,515][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:30:24,309][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:30:25,102][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:30:25,898][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:30:26,697][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:30:28,254][mllm.training.trainer_common][INFO] - For task: Apply reinforce step
,
            ΔVRAM Allocated: 2.288818359375e-05 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:26,
            Percentage of VRAM taken: 69.91579512747062%,
            
[2025-08-20 13:30:29,231][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/policy_optimizer_state.pt
[2025-08-20 13:30:29,233][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/critic_optimizer_state.pt
[2025-08-20 13:30:30,647][__main__][INFO] - Iteration 341 took 54s (37.48% Gen, 62.52% Train). Generation: 20s, Training: 34s. Estimated remaining time: 9h 53m 9s. Estimated total time: 15h 13m 8s. Time estimates for 10 more iterations: 9m 7s, 100 more iterations: 1h 31m 18s, 500 more iterations: 7h 36m 34s.
[2025-08-20 13:30:30,648][__main__][INFO] - Starting iteration 341.
[2025-08-20 13:30:53,800][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 13:30:53,802][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 13:30:53,809][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 13:30:56,267][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 13:30:56,268][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 13:30:56,274][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 13:30:56,277][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data.
[2025-08-20 13:30:56,277][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets.
[2025-08-20 13:30:56,577][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:30:57,368][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:30:58,162][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:30:58,954][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:30:59,748][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:31:00,545][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:31:01,340][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:31:02,131][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:31:02,925][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:31:03,718][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:31:04,511][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:31:05,305][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:31:06,100][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:31:06,892][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:31:07,686][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:31:08,483][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:31:09,278][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:31:10,071][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:31:10,865][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:31:11,662][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:31:12,456][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:31:13,735][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:31:14,528][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:31:15,323][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:31:16,120][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:31:16,914][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:31:17,708][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:31:18,502][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:31:19,300][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:31:20,096][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:31:20,890][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:31:21,683][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:31:23,311][mllm.training.trainer_common][INFO] - For task: Apply reinforce step
,
            ΔVRAM Allocated: 2.288818359375e-05 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:26,
            Percentage of VRAM taken: 69.91579512747062%,
            
[2025-08-20 13:31:24,279][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/policy_optimizer_state.pt
[2025-08-20 13:31:24,281][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/critic_optimizer_state.pt
[2025-08-20 13:31:25,581][__main__][INFO] - Iteration 342 took 54s (37.70% Gen, 62.29% Train). Generation: 20s, Training: 34s. Estimated remaining time: 9h 54m 38s. Estimated total time: 15h 15m 32s. Time estimates for 10 more iterations: 9m 9s, 100 more iterations: 1h 31m 33s, 500 more iterations: 7h 37m 46s.
[2025-08-20 13:31:25,583][__main__][INFO] - Starting iteration 342.
[2025-08-20 13:31:48,979][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 13:31:48,980][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 13:31:48,987][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 13:31:51,459][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 13:31:51,461][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 13:31:51,467][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 13:31:51,470][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data.
[2025-08-20 13:31:51,470][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets.
[2025-08-20 13:31:51,770][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:31:52,562][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:31:53,354][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:31:54,150][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:31:54,946][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:31:55,739][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:31:56,531][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:31:57,326][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:31:58,119][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:31:58,912][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:31:59,708][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:32:00,503][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:32:01,296][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:32:02,089][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:32:02,886][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:32:03,681][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:32:04,474][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:32:05,267][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:32:06,066][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:32:06,861][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:32:08,126][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:32:08,922][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:32:09,716][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:32:10,509][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:32:11,305][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:32:12,100][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:32:12,894][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:32:13,688][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:32:14,484][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:32:15,281][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:32:16,076][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:32:16,870][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:32:18,483][mllm.training.trainer_common][INFO] - For task: Apply reinforce step
,
            ΔVRAM Allocated: 2.288818359375e-05 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:26,
            Percentage of VRAM taken: 69.91579512747062%,
            
[2025-08-20 13:32:19,435][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/policy_optimizer_state.pt
[2025-08-20 13:32:19,436][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/critic_optimizer_state.pt
[2025-08-20 13:32:20,841][__main__][INFO] - Iteration 343 took 55s (37.87% Gen, 62.13% Train). Generation: 20s, Training: 34s. Estimated remaining time: 9h 59m 7s. Estimated total time: 15h 20m 57s. Time estimates for 10 more iterations: 9m 12s, 100 more iterations: 1h 32m 5s, 500 more iterations: 7h 40m 28s.
[2025-08-20 13:32:20,842][__main__][INFO] - Starting iteration 343.
[2025-08-20 13:32:44,464][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 13:32:44,466][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 13:32:44,472][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 13:32:46,913][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 13:32:46,914][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 13:32:46,920][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 13:32:46,923][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data.
[2025-08-20 13:32:46,923][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets.
[2025-08-20 13:32:47,222][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:32:48,013][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:32:48,808][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:32:49,602][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:32:50,394][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:32:51,190][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:32:51,983][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:32:52,775][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:32:53,570][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:32:54,364][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:32:55,157][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:32:55,951][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:32:56,748][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:32:57,543][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:32:58,337][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:32:59,132][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:32:59,929][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:33:00,725][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:33:01,519][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:33:02,312][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:33:03,109][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:33:03,906][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:33:04,702][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:33:05,495][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:33:06,290][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:33:07,086][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:33:08,290][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:33:09,085][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:33:09,879][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:33:10,672][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:33:11,465][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:33:12,260][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:33:13,817][mllm.training.trainer_common][INFO] - For task: Apply reinforce step
,
            ΔVRAM Allocated: 2.288818359375e-05 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:26,
            Percentage of VRAM taken: 69.91579512747062%,
            
[2025-08-20 13:33:14,781][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/policy_optimizer_state.pt
[2025-08-20 13:33:14,783][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/critic_optimizer_state.pt
[2025-08-20 13:33:16,160][__main__][INFO] - Iteration 344 took 55s (38.28% Gen, 61.72% Train). Generation: 21s, Training: 34s. Estimated remaining time: 9h 59m 11s. Estimated total time: 15h 21m 56s. Time estimates for 10 more iterations: 9m 13s, 100 more iterations: 1h 32m 11s, 500 more iterations: 7h 40m 58s.
[2025-08-20 13:33:16,161][__main__][INFO] - Starting iteration 344.
[2025-08-20 13:33:39,233][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 13:33:39,235][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 13:33:39,241][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 13:33:41,697][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 13:33:41,698][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 13:33:41,705][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 13:33:41,707][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data.
[2025-08-20 13:33:41,708][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets.
[2025-08-20 13:33:42,012][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:33:42,803][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:33:43,597][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:33:44,393][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:33:45,188][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:33:45,981][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:33:46,775][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:33:47,569][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:33:48,362][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:33:49,156][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:33:49,952][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:33:50,746][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:33:51,539][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:33:52,335][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:33:53,132][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:33:53,927][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:33:54,721][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:33:55,514][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:33:56,312][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:33:57,108][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:33:57,903][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:33:58,697][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:33:59,956][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:34:00,752][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:34:01,550][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:34:02,346][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:34:03,140][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:34:03,934][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:34:04,730][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:34:05,528][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:34:06,324][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:34:07,119][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:34:08,719][mllm.training.trainer_common][INFO] - For task: Apply reinforce step
,
            ΔVRAM Allocated: 2.288818359375e-05 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:26,
            Percentage of VRAM taken: 69.91579512747062%,
            
[2025-08-20 13:34:09,686][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/policy_optimizer_state.pt
[2025-08-20 13:34:09,688][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/critic_optimizer_state.pt
[2025-08-20 13:34:11,025][__main__][INFO] - Iteration 345 took 54s (37.61% Gen, 62.39% Train). Generation: 20s, Training: 34s. Estimated remaining time: 9h 50m 43s. Estimated total time: 15h 14m 23s. Time estimates for 10 more iterations: 9m 8s, 100 more iterations: 1h 31m 26s, 500 more iterations: 7h 37m 11s.
[2025-08-20 13:34:11,026][__main__][INFO] - Starting iteration 345.
[2025-08-20 13:34:34,149][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 13:34:34,151][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 13:34:34,157][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 13:34:36,627][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 13:34:38,384][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 13:34:38,393][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 13:34:38,396][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data.
[2025-08-20 13:34:38,397][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets.
[2025-08-20 13:34:38,695][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:34:39,486][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:34:40,279][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:34:41,073][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:34:41,865][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:34:42,657][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:34:43,452][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:34:44,244][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:34:45,037][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:34:45,833][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:34:46,626][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:34:48,643][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:34:50,057][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:34:50,853][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:34:51,647][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:34:52,440][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:34:53,234][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:34:54,029][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:34:54,823][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:34:55,616][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:34:56,412][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:34:57,702][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:34:58,494][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:34:59,290][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:35:00,084][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:35:00,876][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:35:01,670][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:35:02,464][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:35:03,257][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:35:04,052][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:35:04,847][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:35:05,641][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:35:07,244][mllm.training.trainer_common][INFO] - For task: Apply reinforce step
,
            ΔVRAM Allocated: 2.288818359375e-05 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:28,
            Percentage of VRAM taken: 69.91579512747062%,
            
[2025-08-20 13:35:08,181][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/policy_optimizer_state.pt
[2025-08-20 13:35:08,182][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/critic_optimizer_state.pt
[2025-08-20 13:35:09,553][__main__][INFO] - Iteration 346 took 58s (35.32% Gen, 64.68% Train). Generation: 20s, Training: 37s. Estimated remaining time: 10h 50m 48s. Estimated total time: 16h 15m 26s. Time estimates for 10 more iterations: 9m 45s, 100 more iterations: 1h 37m 32s, 500 more iterations: 8h 7m 43s.
[2025-08-20 13:35:09,555][__main__][INFO] - Starting iteration 346.
[2025-08-20 13:35:32,850][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 13:35:32,852][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 13:35:32,858][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 13:35:35,324][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 13:35:35,325][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 13:35:35,331][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 13:35:35,334][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data.
[2025-08-20 13:35:35,335][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets.
[2025-08-20 13:35:35,635][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:35:36,424][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:35:37,219][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:35:38,014][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:35:38,808][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:35:39,601][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:35:40,397][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:35:41,189][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:35:41,982][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:35:42,777][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:35:43,570][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:35:44,362][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:35:45,158][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:35:45,954][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:35:46,747][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:35:47,540][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:35:48,336][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:35:49,132][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:35:49,925][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:35:50,719][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:35:51,994][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:35:52,789][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:35:53,596][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:35:54,392][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:35:55,185][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:35:55,979][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:35:56,775][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:35:57,572][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:35:58,368][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:35:59,162][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:35:59,956][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:36:00,753][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:36:02,338][mllm.training.trainer_common][INFO] - For task: Apply reinforce step
,
            ΔVRAM Allocated: 2.288818359375e-05 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:26,
            Percentage of VRAM taken: 69.91579512747062%,
            
[2025-08-20 13:36:03,424][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/policy_optimizer_state.pt
[2025-08-20 13:36:03,427][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/critic_optimizer_state.pt
[2025-08-20 13:36:04,790][__main__][INFO] - Iteration 347 took 55s (37.73% Gen, 62.27% Train). Generation: 20s, Training: 34s. Estimated remaining time: 9h 55m 1s. Estimated total time: 15h 20m 34s. Time estimates for 10 more iterations: 9m 12s, 100 more iterations: 1h 32m 3s, 500 more iterations: 7h 40m 17s.
[2025-08-20 13:36:04,792][__main__][INFO] - Starting iteration 347.
[2025-08-20 13:36:27,782][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 13:36:27,787][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 13:36:27,796][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 13:36:30,275][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 13:36:30,277][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 13:36:30,283][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 13:36:30,285][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data.
[2025-08-20 13:36:30,286][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets.
[2025-08-20 13:36:30,607][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:36:31,399][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:36:32,191][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:36:32,985][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:36:33,782][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:36:34,577][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:36:35,368][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:36:36,163][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:36:36,958][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:36:37,750][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:36:38,543][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:36:39,340][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:36:40,134][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:36:40,928][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:36:41,722][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:36:42,518][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:36:43,313][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:36:44,106][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:36:45,395][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:36:46,189][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:36:46,983][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:36:47,779][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:36:48,575][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:36:49,369][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:36:50,163][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:36:50,959][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:36:51,755][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:36:52,550][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:36:53,344][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:36:54,138][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:36:54,936][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:36:55,732][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:36:57,324][mllm.training.trainer_common][INFO] - For task: Apply reinforce step
,
            ΔVRAM Allocated: 2.288818359375e-05 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:26,
            Percentage of VRAM taken: 69.91579512747062%,
            
[2025-08-20 13:37:00,119][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/policy_optimizer_state.pt
[2025-08-20 13:37:00,123][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/critic_optimizer_state.pt
[2025-08-20 13:37:01,596][__main__][INFO] - Iteration 348 took 56s (36.13% Gen, 63.87% Train). Generation: 20s, Training: 36s. Estimated remaining time: 10h 20m 12s. Estimated total time: 15h 46m 42s. Time estimates for 10 more iterations: 9m 28s, 100 more iterations: 1h 34m 40s, 500 more iterations: 7h 53m 21s.
[2025-08-20 13:37:01,598][__main__][INFO] - Starting iteration 348.
[2025-08-20 13:37:25,221][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 13:37:25,222][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 13:37:25,228][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 13:37:27,661][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 13:37:27,663][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 13:37:27,669][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 13:37:27,671][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data.
[2025-08-20 13:37:27,672][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets.
[2025-08-20 13:37:27,975][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:37:28,765][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:37:29,557][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:37:30,347][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:37:31,142][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:37:31,934][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:37:32,726][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:37:33,521][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:37:34,314][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:37:35,107][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:37:35,903][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:37:36,697][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:37:37,489][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:37:38,283][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:37:39,079][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:37:39,872][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:37:40,665][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:37:41,461][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:37:42,256][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:37:43,049][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:37:43,843][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:37:44,639][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:37:45,435][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:37:46,228][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:37:47,022][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:37:48,245][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:37:49,040][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:37:49,836][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:37:50,629][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:37:51,422][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:37:52,217][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:37:53,013][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:37:54,592][mllm.training.trainer_common][INFO] - For task: Apply reinforce step
,
            ΔVRAM Allocated: 2.288818359375e-05 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:26,
            Percentage of VRAM taken: 69.91579512747062%,
            
[2025-08-20 13:37:55,551][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/policy_optimizer_state.pt
[2025-08-20 13:37:55,552][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/critic_optimizer_state.pt
[2025-08-20 13:37:57,821][__main__][INFO] - Iteration 349 took 56s (37.66% Gen, 62.34% Train). Generation: 21s, Training: 35s. Estimated remaining time: 10h 9m 18s. Estimated total time: 15h 36m 45s. Time estimates for 10 more iterations: 9m 22s, 100 more iterations: 1h 33m 40s, 500 more iterations: 7h 48m 22s.
[2025-08-20 13:37:57,823][__main__][INFO] - Starting iteration 349.
[2025-08-20 13:38:20,919][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 13:38:20,920][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 13:38:20,927][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 13:38:23,371][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 13:38:23,373][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 13:38:23,379][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 13:38:23,381][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data.
[2025-08-20 13:38:23,382][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets.
[2025-08-20 13:38:23,680][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:38:24,469][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:38:25,262][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:38:26,053][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:38:26,846][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:38:27,640][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:38:28,432][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:38:29,227][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:38:30,022][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:38:30,815][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:38:31,609][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:38:32,404][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:38:33,198][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:38:33,991][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:38:34,785][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:38:35,580][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:38:36,810][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:38:37,605][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:38:38,401][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:38:39,195][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:38:39,987][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:38:40,781][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:38:41,576][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:38:42,370][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:38:43,164][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:38:43,961][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:38:44,757][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:38:45,550][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:38:46,344][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:38:47,141][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:38:47,937][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:38:48,731][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:38:50,284][mllm.training.trainer_common][INFO] - For task: Apply reinforce step
,
            ΔVRAM Allocated: 2.288818359375e-05 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:26,
            Percentage of VRAM taken: 69.91579512747062%,
            
[2025-08-20 13:38:51,228][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/policy_optimizer_state.pt
[2025-08-20 13:38:51,230][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/critic_optimizer_state.pt
[2025-08-20 13:38:52,577][__main__][INFO] - Iteration 350 took 54s (37.72% Gen, 62.28% Train). Generation: 20s, Training: 34s. Estimated remaining time: 9h 44m 11s. Estimated total time: 15h 12m 32s. Time estimates for 10 more iterations: 9m 7s, 100 more iterations: 1h 31m 15s, 500 more iterations: 7h 36m 16s.
[2025-08-20 13:38:52,579][__main__][INFO] - Starting iteration 350.
[2025-08-20 13:39:17,351][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 13:39:17,352][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 13:39:17,359][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 13:39:19,813][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 13:39:19,815][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 13:39:19,821][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 13:39:19,823][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data.
[2025-08-20 13:39:19,824][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets.
[2025-08-20 13:39:20,124][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:39:20,912][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:39:21,706][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:39:22,496][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:39:23,289][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:39:24,083][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:39:24,875][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:39:25,669][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:39:26,463][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:39:27,255][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:39:28,050][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:39:28,846][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:39:29,639][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:39:30,432][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:39:31,227][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:39:32,023][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:39:32,816][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:39:33,610][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:39:34,895][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:39:35,688][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:39:36,486][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:39:37,281][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:39:38,075][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:39:38,868][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:39:39,664][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:39:40,458][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:39:41,252][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:39:42,046][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:39:42,843][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:39:43,639][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:39:44,432][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:39:45,226][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:39:46,795][mllm.training.trainer_common][INFO] - For task: Apply reinforce step
,
            ΔVRAM Allocated: 2.288818359375e-05 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:26,
            Percentage of VRAM taken: 69.91579512747062%,
            
[2025-08-20 13:39:47,747][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/policy_optimizer_state.pt
[2025-08-20 13:39:47,748][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/critic_optimizer_state.pt
[2025-08-20 13:39:51,972][__main__][INFO] - Iteration 351 took 59s (37.56% Gen, 57.71% Train). Generation: 22s, Training: 34s. Estimated remaining time: 11h 0m 32s. Estimated total time: 16h 29m 53s. Time estimates for 10 more iterations: 9m 53s, 100 more iterations: 1h 38m 59s, 500 more iterations: 8h 14m 56s.
[2025-08-20 13:39:51,974][__main__][INFO] - Starting iteration 351.
[2025-08-20 13:40:15,457][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 13:40:15,458][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 13:40:15,465][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 13:40:17,904][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 13:40:17,905][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 13:40:17,912][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 13:40:17,914][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data.
[2025-08-20 13:40:17,914][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets.
[2025-08-20 13:40:18,212][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:40:19,002][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:40:19,794][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:40:20,589][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:40:21,381][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:40:22,173][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:40:22,968][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:40:23,760][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:40:24,552][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:40:25,348][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:40:26,142][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:40:26,934][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:40:27,729][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:40:28,524][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:40:29,317][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:40:30,111][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:40:30,907][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:40:31,702][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:40:32,496][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:40:33,290][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:40:34,507][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:40:35,301][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:40:36,095][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:40:36,889][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:40:37,685][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:40:38,480][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:40:39,273][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:40:40,067][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:40:40,863][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:40:41,658][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:40:42,451][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:40:43,245][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:40:44,842][mllm.training.trainer_common][INFO] - For task: Apply reinforce step
,
            ΔVRAM Allocated: 2.288818359375e-05 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:26,
            Percentage of VRAM taken: 69.91579512747062%,
            
[2025-08-20 13:40:45,780][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/policy_optimizer_state.pt
[2025-08-20 13:40:45,781][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/critic_optimizer_state.pt
[2025-08-20 13:40:47,487][__main__][INFO] - Iteration 352 took 55s (37.92% Gen, 62.08% Train). Generation: 21s, Training: 34s. Estimated remaining time: 9h 54m 55s. Estimated total time: 15h 25m 12s. Time estimates for 10 more iterations: 9m 15s, 100 more iterations: 1h 32m 31s, 500 more iterations: 7h 42m 36s.
[2025-08-20 13:40:47,488][__main__][INFO] - Starting iteration 352.
[2025-08-20 13:41:10,800][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 13:41:10,802][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 13:41:10,808][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 13:41:13,244][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 13:41:13,246][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 13:41:13,252][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 13:41:13,254][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data.
[2025-08-20 13:41:13,255][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets.
[2025-08-20 13:41:13,553][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:41:14,343][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:41:15,135][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:41:15,927][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:41:16,719][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:41:17,513][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:41:18,308][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:41:19,100][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:41:19,894][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:41:20,689][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:41:21,481][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:41:22,274][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:41:23,071][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:41:23,865][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:41:24,658][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:41:25,452][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:41:26,249][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:41:27,043][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:41:27,836][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:41:28,631][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:41:29,429][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:41:30,669][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:41:31,461][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:41:32,252][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:41:33,048][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:41:33,839][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:41:34,631][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:41:35,425][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:41:36,218][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:41:37,009][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:41:37,805][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:41:38,598][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:41:40,197][mllm.training.trainer_common][INFO] - For task: Apply reinforce step
,
            ΔVRAM Allocated: 2.288818359375e-05 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:26,
            Percentage of VRAM taken: 69.91579512747062%,
            
[2025-08-20 13:41:41,143][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/policy_optimizer_state.pt
[2025-08-20 13:41:41,145][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/critic_optimizer_state.pt
[2025-08-20 13:41:42,504][__main__][INFO] - Iteration 353 took 55s (37.94% Gen, 62.06% Train). Generation: 20s, Training: 34s. Estimated remaining time: 9h 45m 44s. Estimated total time: 15h 16m 55s. Time estimates for 10 more iterations: 9m 10s, 100 more iterations: 1h 31m 41s, 500 more iterations: 7h 38m 27s.
[2025-08-20 13:41:42,506][__main__][INFO] - Starting iteration 353.
[2025-08-20 13:42:06,006][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 13:42:06,008][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 13:42:06,014][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 13:42:08,453][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 13:42:08,455][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 13:42:08,461][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 13:42:08,463][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data.
[2025-08-20 13:42:08,464][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets.
[2025-08-20 13:42:08,761][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:42:09,554][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:42:10,346][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:42:11,139][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:42:11,935][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:42:12,728][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:42:13,520][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:42:14,315][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:42:15,110][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:42:15,903][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:42:16,697][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:42:17,494][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:42:18,289][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:42:19,082][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:42:19,877][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:42:20,673][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:42:21,468][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:42:22,261][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:42:23,056][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:42:24,354][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:42:25,149][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:42:25,942][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:42:26,737][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:42:27,533][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:42:28,329][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:42:29,124][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:42:29,918][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:42:30,714][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:42:31,510][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:42:32,306][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:42:33,102][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:42:33,897][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:42:35,486][mllm.training.trainer_common][INFO] - For task: Apply reinforce step
,
            ΔVRAM Allocated: 2.288818359375e-05 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:26,
            Percentage of VRAM taken: 69.91579512747062%,
            
[2025-08-20 13:42:36,440][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/policy_optimizer_state.pt
[2025-08-20 13:42:36,441][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/critic_optimizer_state.pt
[2025-08-20 13:42:37,794][__main__][INFO] - Iteration 354 took 55s (38.10% Gen, 61.90% Train). Generation: 21s, Training: 34s. Estimated remaining time: 9h 49m 20s. Estimated total time: 15h 21m 27s. Time estimates for 10 more iterations: 9m 12s, 100 more iterations: 1h 32m 8s, 500 more iterations: 7h 40m 43s.
[2025-08-20 13:42:37,795][__main__][INFO] - Starting iteration 354.
[2025-08-20 13:43:01,266][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 13:43:01,267][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 13:43:01,274][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 13:43:03,744][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 13:43:03,746][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 13:43:03,752][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 13:43:03,754][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data.
[2025-08-20 13:43:03,755][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets.
[2025-08-20 13:43:04,054][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:43:04,844][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:43:05,639][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:43:06,433][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:43:07,226][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:43:08,020][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:43:08,816][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:43:09,609][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:43:10,402][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:43:11,198][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:43:11,993][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:43:12,787][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:43:13,581][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:43:14,377][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:43:15,175][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:43:15,968][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:43:16,763][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:43:17,560][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:43:18,356][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:43:19,152][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:43:19,946][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:43:20,740][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:43:21,534][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:43:22,330][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:43:23,124][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:43:23,918][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:43:25,127][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:43:25,919][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:43:26,711][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:43:27,503][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:43:28,296][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:43:29,093][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:43:30,708][mllm.training.trainer_common][INFO] - For task: Apply reinforce step
,
            ΔVRAM Allocated: 2.288818359375e-05 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:26,
            Percentage of VRAM taken: 69.91579512747062%,
            
[2025-08-20 13:43:31,680][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/policy_optimizer_state.pt
[2025-08-20 13:43:31,682][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/critic_optimizer_state.pt
[2025-08-20 13:43:33,014][__main__][INFO] - Iteration 355 took 55s (38.04% Gen, 61.96% Train). Generation: 21s, Training: 34s. Estimated remaining time: 9h 47m 16s. Estimated total time: 15h 20m 18s. Time estimates for 10 more iterations: 9m 12s, 100 more iterations: 1h 32m 1s, 500 more iterations: 7h 40m 9s.
[2025-08-20 13:43:33,015][__main__][INFO] - Starting iteration 355.
[2025-08-20 13:43:56,061][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 13:43:56,062][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 13:43:56,068][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 13:43:58,533][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 13:43:58,535][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 13:43:58,541][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 13:43:58,543][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data.
[2025-08-20 13:43:58,544][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets.
[2025-08-20 13:43:58,843][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:43:59,635][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:44:00,426][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:44:01,221][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:44:02,015][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:44:02,806][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:44:03,602][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:44:04,396][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:44:05,190][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:44:05,983][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:44:06,779][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:44:07,573][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:44:08,366][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:44:09,162][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:44:09,959][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:44:10,754][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:44:11,981][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:44:12,777][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:44:13,571][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:44:14,364][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:44:15,161][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:44:15,958][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:44:16,754][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:44:17,548][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:44:18,341][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:44:19,139][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:44:19,936][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:44:20,731][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:44:21,525][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:44:22,320][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:44:23,114][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:44:23,910][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:44:25,499][mllm.training.trainer_common][INFO] - For task: Apply reinforce step
,
            ΔVRAM Allocated: 2.288818359375e-05 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:26,
            Percentage of VRAM taken: 69.91579512747062%,
            
[2025-08-20 13:44:26,455][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/policy_optimizer_state.pt
[2025-08-20 13:44:26,458][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/critic_optimizer_state.pt
[2025-08-20 13:44:28,009][__main__][INFO] - Iteration 356 took 54s (37.45% Gen, 62.55% Train). Generation: 20s, Training: 34s. Estimated remaining time: 9h 42m 35s. Estimated total time: 15h 16m 32s. Time estimates for 10 more iterations: 9m 9s, 100 more iterations: 1h 31m 39s, 500 more iterations: 7h 38m 16s.
[2025-08-20 13:44:28,010][__main__][INFO] - Starting iteration 356.
[2025-08-20 13:44:51,264][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 13:44:51,265][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 13:44:51,272][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 13:44:53,720][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 13:44:53,722][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 13:44:53,728][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 13:44:53,731][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data.
[2025-08-20 13:44:53,731][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets.
[2025-08-20 13:44:54,029][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:44:54,822][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:44:55,614][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:44:56,406][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:44:57,201][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:44:57,993][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:44:58,787][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:44:59,583][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:45:00,378][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:45:01,170][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:45:01,964][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:45:02,759][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:45:03,552][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:45:04,345][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:45:05,142][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:45:05,937][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:45:06,731][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:45:07,525][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:45:08,321][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:45:09,117][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:45:09,911][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:45:10,704][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:45:12,005][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:45:12,799][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:45:13,594][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:45:14,387][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:45:15,182][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:45:15,979][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:45:16,774][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:45:17,568][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:45:18,362][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:45:19,160][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:45:20,709][mllm.training.trainer_common][INFO] - For task: Apply reinforce step
,
            ΔVRAM Allocated: 2.288818359375e-05 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:26,
            Percentage of VRAM taken: 69.91579512747062%,
            
[2025-08-20 13:45:21,658][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/policy_optimizer_state.pt
[2025-08-20 13:45:21,660][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/critic_optimizer_state.pt
[2025-08-20 13:45:22,970][__main__][INFO] - Iteration 357 took 54s (37.85% Gen, 62.15% Train). Generation: 20s, Training: 34s. Estimated remaining time: 9h 41m 7s. Estimated total time: 15h 15m 59s. Time estimates for 10 more iterations: 9m 9s, 100 more iterations: 1h 31m 35s, 500 more iterations: 7h 37m 59s.
[2025-08-20 13:45:22,972][__main__][INFO] - Starting iteration 357.
[2025-08-20 13:45:47,031][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 13:45:47,032][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 13:45:47,038][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 13:45:49,507][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 13:45:49,509][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 13:45:49,515][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 13:45:49,517][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data.
[2025-08-20 13:45:49,518][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets.
[2025-08-20 13:45:49,816][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:45:50,607][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:45:51,399][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:45:52,191][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:45:52,987][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:45:53,781][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:45:54,574][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:45:55,368][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:45:56,163][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:45:56,955][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:45:57,749][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:45:58,545][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:45:59,340][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:46:00,133][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:46:00,928][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:46:01,726][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:46:02,521][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:46:03,315][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:46:04,108][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:46:04,905][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:46:05,701][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:46:06,495][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:46:07,288][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:46:08,083][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:46:08,878][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:46:09,672][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:46:10,466][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:46:11,262][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:46:12,058][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:46:13,274][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:46:14,068][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:46:14,864][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:46:16,457][mllm.training.trainer_common][INFO] - For task: Apply reinforce step
,
            ΔVRAM Allocated: 2.288818359375e-05 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:26,
            Percentage of VRAM taken: 69.91579512747062%,
            
[2025-08-20 13:46:17,418][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/policy_optimizer_state.pt
[2025-08-20 13:46:17,420][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/critic_optimizer_state.pt
[2025-08-20 13:46:18,792][__main__][INFO] - Iteration 358 took 55s (38.70% Gen, 61.30% Train). Generation: 21s, Training: 34s. Estimated remaining time: 9h 54m 32s. Estimated total time: 15h 30m 19s. Time estimates for 10 more iterations: 9m 18s, 100 more iterations: 1h 33m 1s, 500 more iterations: 7h 45m 9s.
[2025-08-20 13:46:18,794][__main__][INFO] - Starting iteration 358.
[2025-08-20 13:46:42,266][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 13:46:42,267][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 13:46:42,273][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 13:46:44,726][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 13:46:44,727][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 13:46:44,733][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 13:46:44,736][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data.
[2025-08-20 13:46:44,736][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets.
[2025-08-20 13:46:45,035][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:46:45,828][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:46:46,620][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:46:47,412][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:46:48,208][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:46:49,000][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:46:49,793][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:46:50,589][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:46:51,383][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:46:52,176][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:46:52,971][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:46:53,767][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:46:54,561][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:46:55,353][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:46:56,149][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:46:56,946][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:46:57,739][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:46:58,533][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:46:59,329][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:47:00,126][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:47:00,922][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:47:02,158][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:47:02,952][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:47:03,748][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:47:04,545][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:47:05,340][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:47:06,134][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:47:06,929][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:47:07,726][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:47:08,524][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:47:09,321][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:47:10,115][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:47:11,727][mllm.training.trainer_common][INFO] - For task: Apply reinforce step
,
            ΔVRAM Allocated: 2.288818359375e-05 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:26,
            Percentage of VRAM taken: 69.91579512747062%,
            
[2025-08-20 13:47:12,680][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/policy_optimizer_state.pt
[2025-08-20 13:47:12,682][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/critic_optimizer_state.pt
[2025-08-20 13:47:14,138][__main__][INFO] - Iteration 359 took 55s (38.00% Gen, 62.00% Train). Generation: 21s, Training: 34s. Estimated remaining time: 9h 45m 40s. Estimated total time: 15h 22m 23s. Time estimates for 10 more iterations: 9m 13s, 100 more iterations: 1h 32m 14s, 500 more iterations: 7h 41m 11s.
[2025-08-20 13:47:14,139][__main__][INFO] - Starting iteration 359.
[2025-08-20 13:47:37,713][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 13:47:37,715][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 13:47:37,721][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 13:47:40,210][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 13:47:40,212][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 13:47:40,218][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 13:47:40,220][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data.
[2025-08-20 13:47:40,221][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets.
[2025-08-20 13:47:40,520][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:47:41,309][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:47:42,099][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:47:42,891][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:47:43,681][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:47:44,473][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:47:45,265][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:47:46,056][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:47:46,850][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:47:47,641][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:47:48,432][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:47:49,225][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:47:50,017][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:47:50,811][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:47:51,604][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:47:52,395][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:47:53,190][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:47:53,982][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:47:54,773][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:47:55,568][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:47:56,360][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:47:57,152][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:47:57,947][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:47:58,739][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:48:00,019][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:48:00,812][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:48:01,608][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:48:02,404][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:48:03,199][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:48:03,993][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:48:04,788][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:48:05,586][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:48:07,214][mllm.training.trainer_common][INFO] - For task: Apply reinforce step
,
            ΔVRAM Allocated: 2.288818359375e-05 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:26,
            Percentage of VRAM taken: 69.91579512747062%,
            
[2025-08-20 13:48:08,174][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/policy_optimizer_state.pt
[2025-08-20 13:48:08,175][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/critic_optimizer_state.pt
[2025-08-20 13:48:09,420][__main__][INFO] - Iteration 360 took 55s (38.17% Gen, 61.83% Train). Generation: 21s, Training: 34s. Estimated remaining time: 9h 43m 41s. Estimated total time: 15h 21m 19s. Time estimates for 10 more iterations: 9m 12s, 100 more iterations: 1h 32m 7s, 500 more iterations: 7h 40m 39s.
[2025-08-20 13:48:09,421][__main__][INFO] - Starting iteration 360.
[2025-08-20 13:48:32,655][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 13:48:32,656][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 13:48:32,662][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 13:48:35,120][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 13:48:35,122][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 13:48:35,128][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 13:48:35,130][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data.
[2025-08-20 13:48:35,131][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets.
[2025-08-20 13:48:35,429][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:48:36,218][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:48:37,011][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:48:37,803][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:48:38,597][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:48:39,391][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:48:40,184][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:48:40,977][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:48:41,774][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:48:42,568][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:48:43,361][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:48:44,156][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:48:44,952][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:48:45,745][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:48:46,539][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:48:47,335][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:48:48,132][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:48:48,926][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:48:49,719][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:48:50,514][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:48:51,311][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:48:52,525][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:48:53,319][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:48:54,114][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:48:54,911][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:48:55,707][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:48:56,500][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:48:57,295][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:48:58,093][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:48:58,891][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:48:59,688][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:49:00,483][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:49:02,102][mllm.training.trainer_common][INFO] - For task: Apply reinforce step
,
            ΔVRAM Allocated: 2.288818359375e-05 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:26,
            Percentage of VRAM taken: 69.91579512747062%,
            
[2025-08-20 13:49:03,189][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/policy_optimizer_state.pt
[2025-08-20 13:49:03,191][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/critic_optimizer_state.pt
[2025-08-20 13:49:04,620][__main__][INFO] - Iteration 361 took 55s (37.66% Gen, 62.34% Train). Generation: 20s, Training: 34s. Estimated remaining time: 9h 41m 24s. Estimated total time: 15h 19m 58s. Time estimates for 10 more iterations: 9m 11s, 100 more iterations: 1h 31m 59s, 500 more iterations: 7h 39m 59s.
[2025-08-20 13:49:04,622][__main__][INFO] - Starting iteration 361.
[2025-08-20 13:49:27,835][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 13:49:27,836][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 13:49:27,843][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 13:49:30,297][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 13:49:30,298][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 13:49:30,305][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 13:49:30,307][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data.
[2025-08-20 13:49:30,307][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets.
[2025-08-20 13:49:30,606][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:49:31,397][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:49:32,190][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:49:32,982][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:49:33,777][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:49:34,569][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:49:35,362][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:49:36,158][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:49:36,953][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:49:37,746][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:49:38,540][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:49:39,336][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:49:40,132][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:49:40,925][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:49:41,719][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:49:42,516][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:49:43,312][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:49:44,106][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:49:44,900][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:49:45,696][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:49:46,492][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:49:47,286][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:49:48,079][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:49:48,876][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:49:49,674][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:49:50,903][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:49:51,698][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:49:52,495][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:49:53,292][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:49:54,087][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:49:54,881][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:49:55,677][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:49:57,281][mllm.training.trainer_common][INFO] - For task: Apply reinforce step
,
            ΔVRAM Allocated: 2.288818359375e-05 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:26,
            Percentage of VRAM taken: 69.91579512747062%,
            
[2025-08-20 13:49:58,199][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/policy_optimizer_state.pt
[2025-08-20 13:49:58,201][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/critic_optimizer_state.pt
[2025-08-20 13:49:59,568][__main__][INFO] - Iteration 362 took 54s (37.77% Gen, 62.23% Train). Generation: 20s, Training: 34s. Estimated remaining time: 9h 36m 17s. Estimated total time: 15h 15m 46s. Time estimates for 10 more iterations: 9m 9s, 100 more iterations: 1h 31m 34s, 500 more iterations: 7h 37m 53s.
[2025-08-20 13:49:59,570][__main__][INFO] - Starting iteration 362.
[2025-08-20 13:50:22,583][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 13:50:22,584][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 13:50:22,590][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 13:50:25,059][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 13:50:25,060][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 13:50:25,067][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 13:50:25,069][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data.
[2025-08-20 13:50:25,070][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets.
[2025-08-20 13:50:25,369][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:50:26,158][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:50:26,949][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:50:27,741][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:50:28,534][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:50:29,325][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:50:30,121][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:50:30,913][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:50:31,705][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:50:32,499][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:50:33,291][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:50:34,083][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:50:34,879][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:50:35,672][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:50:36,464][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:50:37,259][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:50:38,053][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:50:38,844][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:50:39,639][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:50:40,433][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:50:41,226][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:50:42,020][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:50:42,816][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:50:43,611][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:50:44,905][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:50:45,700][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:50:46,496][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:50:47,294][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:50:48,088][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:50:48,882][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:50:49,679][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:50:50,477][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:50:52,129][mllm.training.trainer_common][INFO] - For task: Apply reinforce step
,
            ΔVRAM Allocated: 2.288818359375e-05 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:27,
            Percentage of VRAM taken: 69.91579512747062%,
            
[2025-08-20 13:50:53,068][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/policy_optimizer_state.pt
[2025-08-20 13:50:53,070][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/critic_optimizer_state.pt
[2025-08-20 13:50:54,656][__main__][INFO] - Iteration 363 took 55s (37.33% Gen, 62.67% Train). Generation: 20s, Training: 34s. Estimated remaining time: 9h 37m 42s. Estimated total time: 15h 18m 6s. Time estimates for 10 more iterations: 9m 10s, 100 more iterations: 1h 31m 48s, 500 more iterations: 7h 39m 3s.
[2025-08-20 13:50:54,658][__main__][INFO] - Starting iteration 363.
[2025-08-20 13:51:17,715][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 13:51:17,716][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 13:51:17,722][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 13:51:20,173][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 13:51:20,174][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 13:51:20,181][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 13:51:20,183][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data.
[2025-08-20 13:51:20,183][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets.
[2025-08-20 13:51:20,482][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:51:21,271][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:51:22,065][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:51:22,859][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:51:23,651][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:51:24,445][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:51:25,239][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:51:26,032][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:51:26,826][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:51:27,622][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:51:28,415][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:51:29,208][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:51:30,005][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:51:30,802][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:51:31,597][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:51:32,391][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:51:33,185][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:51:33,982][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:51:34,778][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:51:35,571][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:51:36,365][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:51:37,602][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:51:38,399][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:51:39,193][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:51:39,986][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:51:40,782][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:51:41,580][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:51:42,376][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:51:43,171][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:51:43,968][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:51:44,762][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:51:45,561][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:51:47,210][mllm.training.trainer_common][INFO] - For task: Apply reinforce step
,
            ΔVRAM Allocated: 2.288818359375e-05 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:26,
            Percentage of VRAM taken: 69.91579512747062%,
            
[2025-08-20 13:51:48,166][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/policy_optimizer_state.pt
[2025-08-20 13:51:48,167][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/critic_optimizer_state.pt
[2025-08-20 13:51:49,511][__main__][INFO] - Iteration 364 took 54s (37.56% Gen, 62.44% Train). Generation: 20s, Training: 34s. Estimated remaining time: 9h 32m 53s. Estimated total time: 15h 14m 12s. Time estimates for 10 more iterations: 9m 8s, 100 more iterations: 1h 31m 25s, 500 more iterations: 7h 37m 6s.
[2025-08-20 13:51:49,512][__main__][INFO] - Starting iteration 364.
[2025-08-20 13:52:12,481][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 13:52:12,483][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 13:52:12,489][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 13:52:14,969][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 13:52:14,970][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 13:52:14,977][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 13:52:14,979][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data.
[2025-08-20 13:52:14,979][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets.
[2025-08-20 13:52:15,279][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:52:16,069][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:52:16,863][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:52:17,656][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:52:18,449][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:52:19,244][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:52:20,036][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:52:20,829][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:52:21,626][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:52:22,420][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:52:23,214][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:52:24,008][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:52:24,804][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:52:25,599][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:52:26,392][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:52:27,187][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:52:27,984][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:52:29,220][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:52:30,013][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:52:30,807][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:52:31,605][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:52:32,401][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:52:33,195][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:52:33,989][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:52:34,783][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:52:35,579][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:52:36,374][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:52:37,169][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:52:37,964][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:52:38,762][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:52:39,558][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:52:40,352][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:52:41,978][mllm.training.trainer_common][INFO] - For task: Apply reinforce step
,
            ΔVRAM Allocated: 2.288818359375e-05 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:26,
            Percentage of VRAM taken: 69.91579512747062%,
            
[2025-08-20 13:52:42,914][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/policy_optimizer_state.pt
[2025-08-20 13:52:42,915][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/critic_optimizer_state.pt
[2025-08-20 13:52:46,186][__main__][INFO] - Iteration 365 took 56s (36.15% Gen, 63.85% Train). Generation: 20s, Training: 36s. Estimated remaining time: 10h 2m 18s. Estimated total time: 15h 44m 33s. Time estimates for 10 more iterations: 9m 26s, 100 more iterations: 1h 34m 27s, 500 more iterations: 7h 52m 16s.
[2025-08-20 13:52:46,187][__main__][INFO] - Starting iteration 365.
[2025-08-20 13:53:09,509][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 13:53:09,510][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 13:53:09,516][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 13:53:11,976][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 13:53:11,977][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 13:53:11,983][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 13:53:11,985][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data.
[2025-08-20 13:53:11,986][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets.
[2025-08-20 13:53:12,284][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:53:13,074][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:53:13,868][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:53:14,662][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:53:15,454][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:53:16,249][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:53:17,044][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:53:17,836][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:53:18,631][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:53:19,426][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:53:20,220][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:53:21,012][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:53:21,808][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:53:22,602][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:53:23,396][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:53:24,190][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:53:25,452][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:53:26,247][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:53:27,043][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:53:27,836][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:53:28,630][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:53:29,427][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:53:30,224][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:53:31,018][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:53:31,812][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:53:32,607][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:53:33,404][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:53:34,200][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:53:34,993][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:53:35,787][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:53:36,585][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:53:37,384][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:53:39,031][mllm.training.trainer_common][INFO] - For task: Apply reinforce step
,
            ΔVRAM Allocated: 2.288818359375e-05 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:27,
            Percentage of VRAM taken: 69.91579512747062%,
            
[2025-08-20 13:53:39,980][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/policy_optimizer_state.pt
[2025-08-20 13:53:39,982][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/critic_optimizer_state.pt
[2025-08-20 13:53:41,310][__main__][INFO] - Iteration 366 took 55s (37.84% Gen, 62.16% Train). Generation: 20s, Training: 34s. Estimated remaining time: 9h 35m 32s. Estimated total time: 15h 18m 42s. Time estimates for 10 more iterations: 9m 11s, 100 more iterations: 1h 31m 52s, 500 more iterations: 7h 39m 21s.
[2025-08-20 13:53:41,312][__main__][INFO] - Starting iteration 366.
[2025-08-20 13:54:04,385][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 13:54:04,387][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 13:54:04,393][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 13:54:06,843][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 13:54:06,845][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 13:54:06,851][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 13:54:06,853][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data.
[2025-08-20 13:54:06,854][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets.
[2025-08-20 13:54:07,152][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:54:07,942][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:54:08,735][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:54:09,530][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:54:10,322][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:54:11,115][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:54:11,911][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:54:13,096][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:54:13,892][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:54:15,058][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:54:18,412][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:54:19,207][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:54:20,999][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:54:20,792][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:54:21,587][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:54:22,379][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:54:23,172][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:54:23,968][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:54:24,763][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:54:25,555][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:54:26,350][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:54:27,147][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:54:27,941][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:54:28,735][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:54:29,530][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:54:30,772][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:54:31,569][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:54:34,174][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:54:34,968][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:54:35,766][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:54:36,561][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:54:37,355][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:54:38,991][mllm.training.trainer_common][INFO] - For task: Apply reinforce step
,
            ΔVRAM Allocated: 2.288818359375e-05 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:32,
            Percentage of VRAM taken: 69.91579512747062%,
            
[2025-08-20 13:54:39,934][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/policy_optimizer_state.pt
[2025-08-20 13:54:39,936][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/critic_optimizer_state.pt
[2025-08-20 13:54:41,369][__main__][INFO] - Iteration 367 took 1m 0s (34.35% Gen, 65.65% Train). Generation: 20s, Training: 39s. Estimated remaining time: 10h 56m 46s. Estimated total time: 16h 40m 56s. Time estimates for 10 more iterations: 10m 0s, 100 more iterations: 1h 40m 5s, 500 more iterations: 8h 20m 28s.
[2025-08-20 13:54:41,370][__main__][INFO] - Starting iteration 367.
[2025-08-20 13:55:04,371][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 13:55:04,372][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 13:55:04,378][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 13:55:06,822][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 13:55:06,824][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 13:55:06,831][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 13:55:06,833][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data.
[2025-08-20 13:55:06,834][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets.
[2025-08-20 13:55:07,133][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:55:07,924][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:55:08,716][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:55:09,510][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:55:10,302][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:55:11,095][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:55:11,890][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:55:12,682][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:55:13,476][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:55:14,272][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:55:15,067][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:55:15,860][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:55:16,655][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:55:17,451][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:55:18,246][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:55:19,038][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:55:19,833][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:55:21,152][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:55:21,949][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:55:22,742][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:55:23,535][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:55:24,330][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:55:25,126][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:55:25,919][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:55:26,714][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:55:27,510][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:55:28,308][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:55:29,104][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:55:29,900][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:55:30,694][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:55:31,491][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:55:32,289][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:55:33,911][mllm.training.trainer_common][INFO] - For task: Apply reinforce step
,
            ΔVRAM Allocated: 2.288818359375e-05 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:27,
            Percentage of VRAM taken: 69.91579512747062%,
            
[2025-08-20 13:55:34,905][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/policy_optimizer_state.pt
[2025-08-20 13:55:34,909][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/critic_optimizer_state.pt
[2025-08-20 13:55:36,361][__main__][INFO] - Iteration 368 took 54s (37.38% Gen, 62.62% Train). Generation: 20s, Training: 34s. Estimated remaining time: 9h 31m 25s. Estimated total time: 15h 16m 30s. Time estimates for 10 more iterations: 9m 9s, 100 more iterations: 1h 31m 39s, 500 more iterations: 7h 38m 15s.
[2025-08-20 13:55:36,363][__main__][INFO] - Starting iteration 368.
[2025-08-20 13:55:59,347][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 13:55:59,349][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 13:55:59,355][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 13:56:01,848][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 13:56:01,849][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 13:56:01,855][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 13:56:01,858][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data.
[2025-08-20 13:56:01,858][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets.
[2025-08-20 13:56:02,157][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:56:02,948][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:56:03,740][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:56:04,535][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:56:05,327][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:56:06,120][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:56:06,915][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:56:07,709][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:56:08,503][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:56:09,298][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:56:10,093][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:56:10,886][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:56:11,679][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:56:12,476][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:56:13,273][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:56:14,067][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:56:14,860][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:56:15,656][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:56:16,986][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:56:17,780][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:56:18,574][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:56:19,370][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:56:20,164][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:56:20,958][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:56:21,754][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:56:22,551][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:56:23,347][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:56:24,142][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:56:24,936][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:56:25,734][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:56:26,532][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:56:27,333][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:56:28,945][mllm.training.trainer_common][INFO] - For task: Apply reinforce step
,
            ΔVRAM Allocated: 2.288818359375e-05 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:27,
            Percentage of VRAM taken: 69.91579512747062%,
            
[2025-08-20 13:56:29,924][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/policy_optimizer_state.pt
[2025-08-20 13:56:29,926][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/critic_optimizer_state.pt
[2025-08-20 13:56:33,746][__main__][INFO] - Iteration 369 took 57s (35.71% Gen, 64.29% Train). Generation: 20s, Training: 36s. Estimated remaining time: 10h 10m 19s. Estimated total time: 15h 56m 22s. Time estimates for 10 more iterations: 9m 33s, 100 more iterations: 1h 35m 38s, 500 more iterations: 7h 58m 11s.
[2025-08-20 13:56:33,747][__main__][INFO] - Starting iteration 369.
[2025-08-20 13:56:56,674][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 13:56:56,675][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 13:56:56,681][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 13:56:59,144][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 13:56:59,146][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 13:56:59,152][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 13:56:59,154][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data.
[2025-08-20 13:56:59,156][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets.
[2025-08-20 13:56:59,454][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:57:00,242][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:57:01,034][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:57:01,825][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:57:02,618][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:57:03,410][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:57:04,202][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:57:04,997][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:57:05,789][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:57:06,581][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:57:07,376][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:57:08,168][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:57:08,959][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:57:09,755][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:57:10,953][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:57:11,746][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:57:12,540][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:57:13,336][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:57:14,132][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:57:14,926][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:57:15,719][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:57:16,516][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:57:17,314][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:57:18,109][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:57:18,902][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:57:19,697][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:57:20,496][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:57:21,294][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:57:22,092][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:57:22,888][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:57:23,683][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:57:24,478][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:57:26,091][mllm.training.trainer_common][INFO] - For task: Apply reinforce step
,
            ΔVRAM Allocated: 2.288818359375e-05 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:26,
            Percentage of VRAM taken: 69.91579512747062%,
            
[2025-08-20 13:57:27,099][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/policy_optimizer_state.pt
[2025-08-20 13:57:27,101][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/critic_optimizer_state.pt
[2025-08-20 13:57:28,626][__main__][INFO] - Iteration 370 took 54s (37.28% Gen, 62.71% Train). Generation: 20s, Training: 34s. Estimated remaining time: 9h 27m 40s. Estimated total time: 15h 14m 38s. Time estimates for 10 more iterations: 9m 8s, 100 more iterations: 1h 31m 27s, 500 more iterations: 7h 37m 19s.
[2025-08-20 13:57:28,628][__main__][INFO] - Starting iteration 370.
[2025-08-20 13:57:52,353][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 13:57:52,354][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 13:57:52,361][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 13:57:54,822][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 13:57:54,823][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 13:57:54,830][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 13:57:54,832][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data.
[2025-08-20 13:57:54,833][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets.
[2025-08-20 13:57:55,132][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:57:55,921][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:57:56,712][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:57:57,503][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:57:58,297][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:57:59,088][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:57:59,881][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:58:00,673][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:58:01,466][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:58:02,260][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:58:03,053][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:58:03,845][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:58:04,640][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:58:05,434][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:58:06,226][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:58:07,020][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:58:07,815][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:58:08,607][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:58:09,400][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:58:10,196][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:58:10,990][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:58:11,783][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:58:12,580][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:58:13,908][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:58:14,701][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:58:15,498][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:58:16,296][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:58:17,091][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:58:17,886][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:58:18,680][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:58:19,477][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:58:20,274][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:58:21,893][mllm.training.trainer_common][INFO] - For task: Apply reinforce step
,
            ΔVRAM Allocated: 2.288818359375e-05 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:27,
            Percentage of VRAM taken: 69.91579512747062%,
            
[2025-08-20 13:58:22,953][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/policy_optimizer_state.pt
[2025-08-20 13:58:22,955][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/critic_optimizer_state.pt
[2025-08-20 13:58:24,368][__main__][INFO] - Iteration 371 took 55s (38.17% Gen, 61.83% Train). Generation: 21s, Training: 34s. Estimated remaining time: 9h 41m 6s. Estimated total time: 15h 28m 59s. Time estimates for 10 more iterations: 9m 17s, 100 more iterations: 1h 32m 53s, 500 more iterations: 7h 44m 29s.
[2025-08-20 13:58:24,369][__main__][INFO] - Starting iteration 371.
[2025-08-20 13:58:47,488][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 13:58:47,490][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 13:58:47,496][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 13:58:49,967][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 13:58:49,968][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 13:58:49,974][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 13:58:49,976][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data.
[2025-08-20 13:58:49,977][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets.
[2025-08-20 13:58:50,275][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:58:51,065][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:58:51,858][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:58:52,650][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:58:53,445][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:58:54,239][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:58:55,031][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:58:55,825][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:58:56,620][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:58:57,415][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:58:58,209][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:58:59,006][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:58:59,802][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:59:00,597][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:59:01,390][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:59:02,184][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:59:02,980][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:59:03,775][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:59:04,568][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:59:05,947][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:59:06,743][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:59:07,539][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:59:08,334][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:59:09,128][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:59:09,924][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:59:10,719][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:59:11,516][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:59:12,310][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:59:13,105][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:59:13,902][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:59:14,701][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:59:15,499][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:59:17,164][mllm.training.trainer_common][INFO] - For task: Apply reinforce step
,
            ΔVRAM Allocated: 2.288818359375e-05 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:27,
            Percentage of VRAM taken: 69.91579512747062%,
            
[2025-08-20 13:59:18,114][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/policy_optimizer_state.pt
[2025-08-20 13:59:18,115][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/critic_optimizer_state.pt
[2025-08-20 13:59:19,440][__main__][INFO] - Iteration 372 took 55s (37.52% Gen, 62.48% Train). Generation: 20s, Training: 34s. Estimated remaining time: 9h 29m 2s. Estimated total time: 15h 17m 50s. Time estimates for 10 more iterations: 9m 10s, 100 more iterations: 1h 31m 47s, 500 more iterations: 7h 38m 55s.
[2025-08-20 13:59:19,442][__main__][INFO] - Starting iteration 372.
[2025-08-20 13:59:42,504][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 13:59:42,505][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 13:59:42,512][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 13:59:44,962][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 13:59:44,963][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 13:59:44,970][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 13:59:44,972][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data.
[2025-08-20 13:59:44,972][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets.
[2025-08-20 13:59:45,270][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:59:46,063][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:59:46,855][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:59:47,649][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:59:48,443][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:59:49,237][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:59:50,031][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:59:50,828][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:59:51,623][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:59:52,416][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:59:53,210][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:59:54,007][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:59:54,801][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:59:55,595][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:59:56,389][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:59:57,183][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:59:57,977][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:59:58,770][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 13:59:59,568][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:00:00,366][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:00:01,164][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:00:02,448][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:00:03,246][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:00:04,042][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:00:04,837][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:00:05,633][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:00:06,429][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:00:07,226][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:00:08,025][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:00:08,824][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:00:09,623][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:00:10,422][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:00:12,028][mllm.training.trainer_common][INFO] - For task: Apply reinforce step
,
            ΔVRAM Allocated: 2.288818359375e-05 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:27,
            Percentage of VRAM taken: 69.91579512747062%,
            
[2025-08-20 14:00:13,060][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/policy_optimizer_state.pt
[2025-08-20 14:00:13,062][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/critic_optimizer_state.pt
[2025-08-20 14:00:14,483][__main__][INFO] - Iteration 373 took 55s (37.43% Gen, 62.57% Train). Generation: 20s, Training: 34s. Estimated remaining time: 9h 27m 36s. Estimated total time: 15h 17m 20s. Time estimates for 10 more iterations: 9m 10s, 100 more iterations: 1h 31m 44s, 500 more iterations: 7h 38m 40s.
[2025-08-20 14:00:14,484][__main__][INFO] - Starting iteration 373.
[2025-08-20 14:00:37,541][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 14:00:37,543][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 14:00:37,549][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 14:00:40,013][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 14:00:40,014][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 14:00:40,020][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 14:00:40,022][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data.
[2025-08-20 14:00:40,023][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets.
[2025-08-20 14:00:40,322][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:00:41,113][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:00:41,908][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:00:42,701][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:00:43,494][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:00:44,290][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:00:45,085][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:00:45,878][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:00:46,672][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:00:47,468][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:00:48,261][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:00:49,055][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:00:49,851][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:00:50,647][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:00:51,440][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:00:52,233][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:00:53,028][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:00:53,822][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:00:55,095][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:00:55,888][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:00:56,684][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:00:57,476][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:00:58,269][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:00:59,065][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:00:59,859][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:01:00,652][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:01:01,447][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:01:02,243][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:01:03,037][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:01:03,831][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:01:04,628][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:01:05,426][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:01:07,047][mllm.training.trainer_common][INFO] - For task: Apply reinforce step
,
            ΔVRAM Allocated: 2.288818359375e-05 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:26,
            Percentage of VRAM taken: 69.91579512747062%,
            
[2025-08-20 14:01:07,982][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/policy_optimizer_state.pt
[2025-08-20 14:01:07,984][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/critic_optimizer_state.pt
[2025-08-20 14:01:09,355][__main__][INFO] - Iteration 374 took 54s (37.53% Gen, 62.46% Train). Generation: 20s, Training: 34s. Estimated remaining time: 9h 23m 51s. Estimated total time: 15h 14m 30s. Time estimates for 10 more iterations: 9m 8s, 100 more iterations: 1h 31m 27s, 500 more iterations: 7h 37m 15s.
[2025-08-20 14:01:09,356][__main__][INFO] - Starting iteration 374.
[2025-08-20 14:01:32,443][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 14:01:32,444][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 14:01:32,450][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 14:01:34,922][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 14:01:34,923][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 14:01:34,930][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 14:01:34,932][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data.
[2025-08-20 14:01:34,932][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets.
[2025-08-20 14:01:35,233][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:01:36,023][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:01:36,816][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:01:37,612][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:01:38,405][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:01:39,198][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:01:39,995][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:01:40,791][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:01:41,585][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:01:42,378][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:01:43,174][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:01:43,969][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:01:44,763][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:01:45,557][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:01:46,352][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:01:47,149][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:01:47,943][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:01:48,737][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:01:50,026][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:01:50,819][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:01:51,615][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:01:52,411][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:01:53,209][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:01:54,005][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:01:54,800][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:01:55,594][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:01:56,392][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:01:57,191][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:01:57,989][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:01:58,787][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:01:59,584][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:02:00,379][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:02:02,004][mllm.training.trainer_common][INFO] - For task: Apply reinforce step
,
            ΔVRAM Allocated: 2.288818359375e-05 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:27,
            Percentage of VRAM taken: 69.91579512747062%,
            
[2025-08-20 14:02:02,946][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/policy_optimizer_state.pt
[2025-08-20 14:02:02,947][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/critic_optimizer_state.pt
[2025-08-20 14:02:04,222][__main__][INFO] - Iteration 375 took 54s (37.60% Gen, 62.40% Train). Generation: 20s, Training: 34s. Estimated remaining time: 9h 22m 52s. Estimated total time: 15h 14m 25s. Time estimates for 10 more iterations: 9m 8s, 100 more iterations: 1h 31m 26s, 500 more iterations: 7h 37m 12s.
[2025-08-20 14:02:04,224][__main__][INFO] - Starting iteration 375.
[2025-08-20 14:02:27,876][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 14:02:27,877][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 14:02:27,883][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 14:02:30,326][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 14:02:30,328][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 14:02:30,334][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 14:02:30,336][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data.
[2025-08-20 14:02:30,337][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets.
[2025-08-20 14:02:30,635][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:02:31,425][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:02:32,219][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:02:33,013][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:02:33,806][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:02:34,600][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:02:35,396][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:02:36,190][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:02:36,983][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:02:37,778][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:02:38,575][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:02:39,371][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:02:40,165][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:02:40,959][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:02:41,755][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:02:42,551][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:02:43,344][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:02:44,138][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:02:44,936][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:02:46,155][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:02:46,948][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:02:47,740][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:02:48,535][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:02:49,332][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:02:50,126][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:02:50,919][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:02:51,713][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:02:52,510][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:02:53,305][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:02:54,099][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:02:54,894][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:02:55,690][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:02:57,297][mllm.training.trainer_common][INFO] - For task: Apply reinforce step
,
            ΔVRAM Allocated: 2.288818359375e-05 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:26,
            Percentage of VRAM taken: 69.91579512747062%,
            
[2025-08-20 14:02:58,251][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/policy_optimizer_state.pt
[2025-08-20 14:02:58,252][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/critic_optimizer_state.pt
[2025-08-20 14:02:59,651][__main__][INFO] - Iteration 376 took 55s (38.27% Gen, 61.72% Train). Generation: 21s, Training: 34s. Estimated remaining time: 9h 31m 18s. Estimated total time: 15h 23m 47s. Time estimates for 10 more iterations: 9m 14s, 100 more iterations: 1h 32m 22s, 500 more iterations: 7h 41m 53s.
[2025-08-20 14:02:59,653][__main__][INFO] - Starting iteration 376.
[2025-08-20 14:03:22,940][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 14:03:22,941][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 14:03:22,947][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 14:03:25,403][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 14:03:25,404][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 14:03:25,410][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 14:03:25,412][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data.
[2025-08-20 14:03:25,413][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets.
[2025-08-20 14:03:25,712][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:03:26,501][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:03:27,295][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:03:28,088][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:03:28,882][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:03:29,679][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:03:30,473][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:03:31,267][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:03:32,061][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:03:32,858][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:03:33,653][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:03:34,446][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:03:35,241][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:03:36,038][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:03:36,833][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:03:37,626][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:03:38,421][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:03:39,217][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:03:40,013][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:03:40,807][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:03:41,602][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:03:42,397][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:03:43,642][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:03:44,437][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:03:45,236][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:03:46,032][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:03:46,827][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:03:47,622][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:03:48,418][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:03:49,216][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:03:50,015][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:03:50,814][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:03:52,435][mllm.training.trainer_common][INFO] - For task: Apply reinforce step
,
            ΔVRAM Allocated: 2.288818359375e-05 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:26,
            Percentage of VRAM taken: 69.91579512747062%,
            
[2025-08-20 14:03:53,402][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/policy_optimizer_state.pt
[2025-08-20 14:03:53,403][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/critic_optimizer_state.pt
[2025-08-20 14:03:54,752][__main__][INFO] - Iteration 377 took 55s (37.80% Gen, 62.19% Train). Generation: 20s, Training: 34s. Estimated remaining time: 9h 24m 55s. Estimated total time: 15h 18m 19s. Time estimates for 10 more iterations: 9m 10s, 100 more iterations: 1h 31m 49s, 500 more iterations: 7h 39m 9s.
[2025-08-20 14:03:54,757][__main__][INFO] - Starting iteration 377.
[2025-08-20 14:04:17,769][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 14:04:17,770][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 14:04:17,777][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 14:04:20,206][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 14:04:20,207][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 14:04:20,213][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 14:04:20,215][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data.
[2025-08-20 14:04:20,216][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets.
[2025-08-20 14:04:20,513][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:04:21,302][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:04:22,093][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:04:22,884][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:04:23,677][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:04:24,469][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:04:25,262][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:04:26,056][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:04:26,849][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:04:27,642][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:04:28,436][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:04:29,229][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:04:30,023][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:04:30,820][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:04:31,614][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:04:32,407][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:04:33,201][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:04:33,995][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:04:34,788][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:04:36,082][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:04:36,879][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:04:37,674][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:04:38,468][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:04:39,263][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:04:40,060][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:04:40,859][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:04:41,657][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:04:42,454][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:04:43,250][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:04:44,045][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:04:44,841][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:04:45,637][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:04:47,288][mllm.training.trainer_common][INFO] - For task: Apply reinforce step
,
            ΔVRAM Allocated: 2.288818359375e-05 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:27,
            Percentage of VRAM taken: 69.91579512747062%,
            
[2025-08-20 14:04:48,210][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/policy_optimizer_state.pt
[2025-08-20 14:04:48,212][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/critic_optimizer_state.pt
[2025-08-20 14:04:49,591][__main__][INFO] - Iteration 378 took 54s (37.53% Gen, 62.47% Train). Generation: 20s, Training: 34s. Estimated remaining time: 9h 19m 35s. Estimated total time: 15h 13m 53s. Time estimates for 10 more iterations: 9m 8s, 100 more iterations: 1h 31m 23s, 500 more iterations: 7h 36m 56s.
[2025-08-20 14:04:49,593][__main__][INFO] - Starting iteration 378.
[2025-08-20 14:05:12,549][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 14:05:12,551][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 14:05:12,557][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 14:05:15,018][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 14:05:15,019][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 14:05:15,026][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 14:05:15,028][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data.
[2025-08-20 14:05:15,029][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets.
[2025-08-20 14:05:15,327][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:05:16,120][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:05:16,912][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:05:17,706][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:05:18,500][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:05:19,293][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:05:20,089][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:05:20,885][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:05:21,681][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:05:22,473][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:05:23,268][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:05:24,065][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:05:24,861][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:05:25,654][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:05:26,856][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:05:27,650][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:05:28,445][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:05:29,243][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:05:30,039][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:05:30,835][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:05:31,629][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:05:32,424][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:05:33,219][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:05:34,013][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:05:34,807][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:05:35,603][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:05:36,401][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:05:37,200][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:05:37,998][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:05:38,795][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:05:39,591][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:05:40,386][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:05:42,025][mllm.training.trainer_common][INFO] - For task: Apply reinforce step
,
            ΔVRAM Allocated: 2.288818359375e-05 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:26,
            Percentage of VRAM taken: 69.91579512747062%,
            
[2025-08-20 14:05:42,973][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/policy_optimizer_state.pt
[2025-08-20 14:05:42,974][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/critic_optimizer_state.pt
[2025-08-20 14:05:44,540][__main__][INFO] - Iteration 379 took 54s (37.31% Gen, 62.69% Train). Generation: 20s, Training: 34s. Estimated remaining time: 9h 20m 33s. Estimated total time: 15h 15m 47s. Time estimates for 10 more iterations: 9m 9s, 100 more iterations: 1h 31m 34s, 500 more iterations: 7h 37m 53s.
[2025-08-20 14:05:44,542][__main__][INFO] - Starting iteration 379.
[2025-08-20 14:06:07,613][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 14:06:07,615][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 14:06:07,621][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 14:06:10,060][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 14:06:10,062][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 14:06:10,068][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 14:06:10,070][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data.
[2025-08-20 14:06:10,071][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets.
[2025-08-20 14:06:10,370][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:06:11,160][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:06:11,950][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:06:12,743][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:06:13,534][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:06:14,328][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:06:15,121][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:06:15,913][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:06:16,708][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:06:17,503][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:06:18,295][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:06:19,088][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:06:19,883][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:06:20,675][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:06:21,468][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:06:22,265][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:06:23,058][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:06:23,850][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:06:25,081][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:06:25,876][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:06:26,669][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:06:27,466][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:06:28,264][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:06:29,059][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:06:29,853][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:06:30,648][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:06:31,442][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:06:32,239][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:06:33,034][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:06:33,829][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:06:34,626][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:06:35,425][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:06:37,045][mllm.training.trainer_common][INFO] - For task: Apply reinforce step
,
            ΔVRAM Allocated: 2.288818359375e-05 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:26,
            Percentage of VRAM taken: 69.91579512747062%,
            
[2025-08-20 14:06:38,021][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/policy_optimizer_state.pt
[2025-08-20 14:06:38,022][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/critic_optimizer_state.pt
[2025-08-20 14:06:39,462][__main__][INFO] - Iteration 380 took 54s (37.57% Gen, 62.43% Train). Generation: 20s, Training: 34s. Estimated remaining time: 9h 19m 11s. Estimated total time: 15h 15m 20s. Time estimates for 10 more iterations: 9m 9s, 100 more iterations: 1h 31m 32s, 500 more iterations: 7h 37m 40s.
[2025-08-20 14:06:39,470][__main__][INFO] - Starting iteration 380.
[2025-08-20 14:07:02,853][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 14:07:02,854][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 14:07:02,861][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 14:07:05,332][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 14:07:05,333][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 14:07:05,340][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 14:07:05,342][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data.
[2025-08-20 14:07:05,343][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets.
[2025-08-20 14:07:05,641][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:07:06,431][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:07:07,224][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:07:08,016][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:07:08,812][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:07:09,606][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:07:10,399][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:07:11,193][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:07:11,989][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:07:12,783][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:07:13,576][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:07:14,372][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:07:15,169][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:07:15,964][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:07:16,758][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:07:17,551][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:07:18,348][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:07:19,143][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:07:19,937][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:07:20,731][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:07:21,528][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:07:22,324][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:07:23,641][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:07:24,434][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:07:25,229][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:07:26,026][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:07:26,822][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:07:27,617][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:07:28,412][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:07:29,208][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:07:30,007][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:07:30,805][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:07:32,426][mllm.training.trainer_common][INFO] - For task: Apply reinforce step
,
            ΔVRAM Allocated: 2.288818359375e-05 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:27,
            Percentage of VRAM taken: 69.91579512747062%,
            
[2025-08-20 14:07:33,335][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/policy_optimizer_state.pt
[2025-08-20 14:07:33,336][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/critic_optimizer_state.pt
[2025-08-20 14:07:34,621][__main__][INFO] - Iteration 381 took 55s (37.92% Gen, 62.08% Train). Generation: 20s, Training: 34s. Estimated remaining time: 9h 22m 7s. Estimated total time: 15h 19m 11s. Time estimates for 10 more iterations: 9m 11s, 100 more iterations: 1h 31m 55s, 500 more iterations: 7h 39m 35s.
[2025-08-20 14:07:34,623][__main__][INFO] - Starting iteration 381.
[2025-08-20 14:07:57,593][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 14:07:57,594][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 14:07:57,600][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 14:08:00,048][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 14:08:00,049][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 14:08:00,056][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 14:08:00,058][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data.
[2025-08-20 14:08:00,059][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets.
[2025-08-20 14:08:00,357][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:08:01,150][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:08:01,942][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:08:02,735][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:08:03,529][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:08:04,321][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:08:05,116][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:08:05,912][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:08:06,705][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:08:07,498][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:08:08,292][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:08:09,085][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:08:09,879][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:08:10,675][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:08:11,471][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:08:12,266][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:08:13,059][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:08:13,853][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:08:14,650][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:08:15,445][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:08:16,239][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:08:17,442][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:08:18,236][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:08:19,032][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:08:19,829][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:08:20,624][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:08:21,418][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:08:22,213][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:08:23,011][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:08:23,809][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:08:24,606][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:08:25,402][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:08:27,017][mllm.training.trainer_common][INFO] - For task: Apply reinforce step
,
            ΔVRAM Allocated: 2.288818359375e-05 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:26,
            Percentage of VRAM taken: 69.91579512747062%,
            
[2025-08-20 14:08:27,982][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/policy_optimizer_state.pt
[2025-08-20 14:08:27,984][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/critic_optimizer_state.pt
[2025-08-20 14:08:29,348][__main__][INFO] - Iteration 382 took 54s (37.51% Gen, 62.49% Train). Generation: 20s, Training: 34s. Estimated remaining time: 9h 14m 6s. Estimated total time: 15h 12m 4s. Time estimates for 10 more iterations: 9m 7s, 100 more iterations: 1h 31m 12s, 500 more iterations: 7h 36m 2s.
[2025-08-20 14:08:29,349][__main__][INFO] - Starting iteration 382.
[2025-08-20 14:08:52,323][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 14:08:52,324][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 14:08:52,331][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 14:08:54,797][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 14:08:54,799][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 14:08:54,805][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 14:08:54,807][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data.
[2025-08-20 14:08:54,808][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets.
[2025-08-20 14:08:55,108][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:08:55,898][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:08:56,692][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:08:57,484][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:08:58,279][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:08:59,074][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:08:59,868][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:09:00,661][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:09:01,457][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:09:02,252][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:09:03,046][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:09:03,839][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:09:04,636][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:09:05,431][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:09:06,224][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:09:07,018][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:09:07,815][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:09:08,610][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:09:09,405][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:09:10,198][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:09:10,992][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:09:11,787][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:09:12,581][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:09:13,376][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:09:14,762][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:09:15,556][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:09:16,353][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:09:17,150][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:09:17,946][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:09:18,740][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:09:19,535][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:09:20,333][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:09:21,966][mllm.training.trainer_common][INFO] - For task: Apply reinforce step
,
            ΔVRAM Allocated: 2.288818359375e-05 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:27,
            Percentage of VRAM taken: 69.91579512747062%,
            
[2025-08-20 14:09:22,947][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/policy_optimizer_state.pt
[2025-08-20 14:09:22,949][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/critic_optimizer_state.pt
[2025-08-20 14:09:24,322][__main__][INFO] - Iteration 383 took 54s (37.31% Gen, 62.69% Train). Generation: 20s, Training: 34s. Estimated remaining time: 9h 17m 19s. Estimated total time: 15h 16m 12s. Time estimates for 10 more iterations: 9m 9s, 100 more iterations: 1h 31m 37s, 500 more iterations: 7h 38m 6s.
[2025-08-20 14:09:24,324][__main__][INFO] - Starting iteration 383.
[2025-08-20 14:09:47,348][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 14:09:47,349][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 14:09:47,355][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 14:09:49,822][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 14:09:49,824][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 14:09:49,830][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 14:09:49,832][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data.
[2025-08-20 14:09:49,833][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets.
[2025-08-20 14:09:50,131][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:09:50,921][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:09:51,715][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:09:52,507][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:09:53,302][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:09:54,098][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:09:54,891][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:09:55,685][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:09:56,478][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:09:57,273][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:09:58,069][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:09:58,864][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:09:59,660][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:10:00,456][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:10:01,251][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:10:02,044][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:10:02,839][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:10:03,636][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:10:04,934][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:10:05,729][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:10:06,522][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:10:07,319][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:10:08,118][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:10:08,915][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:10:09,711][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:10:10,505][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:10:11,299][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:10:12,096][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:10:12,893][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:10:13,691][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:10:14,487][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:10:15,283][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:10:16,944][mllm.training.trainer_common][INFO] - For task: Apply reinforce step
,
            ΔVRAM Allocated: 2.288818359375e-05 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:27,
            Percentage of VRAM taken: 69.91579512747062%,
            
[2025-08-20 14:10:17,949][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/policy_optimizer_state.pt
[2025-08-20 14:10:17,951][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/critic_optimizer_state.pt
[2025-08-20 14:10:19,376][__main__][INFO] - Iteration 384 took 55s (37.41% Gen, 62.58% Train). Generation: 20s, Training: 34s. Estimated remaining time: 9h 17m 43s. Estimated total time: 15h 17m 31s. Time estimates for 10 more iterations: 9m 10s, 100 more iterations: 1h 31m 45s, 500 more iterations: 7h 38m 45s.
[2025-08-20 14:10:19,377][__main__][INFO] - Starting iteration 384.
[2025-08-20 14:10:42,518][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 14:10:42,520][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 14:10:42,526][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 14:10:45,014][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 14:10:45,016][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 14:10:45,022][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 14:10:45,024][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data.
[2025-08-20 14:10:45,025][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets.
[2025-08-20 14:10:45,328][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:10:46,121][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:10:46,913][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:10:47,704][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:10:48,499][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:10:49,291][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:10:50,085][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:10:50,882][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:10:51,675][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:10:52,469][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:10:53,265][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:10:54,062][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:10:54,858][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:10:55,651][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:10:56,444][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:10:57,241][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:10:58,037][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:10:59,248][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:11:00,043][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:11:00,840][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:11:01,636][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:11:02,430][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:11:03,224][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:11:04,021][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:11:04,820][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:11:05,618][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:11:06,414][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:11:07,210][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:11:08,004][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:11:08,801][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:11:09,600][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:11:10,398][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:11:12,021][mllm.training.trainer_common][INFO] - For task: Apply reinforce step
,
            ΔVRAM Allocated: 2.288818359375e-05 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:26,
            Percentage of VRAM taken: 69.91579512747062%,
            
[2025-08-20 14:11:12,958][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/policy_optimizer_state.pt
[2025-08-20 14:11:12,959][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/critic_optimizer_state.pt
[2025-08-20 14:11:14,408][__main__][INFO] - Iteration 385 took 55s (37.58% Gen, 62.42% Train). Generation: 20s, Training: 34s. Estimated remaining time: 9h 16m 26s. Estimated total time: 15h 17m 10s. Time estimates for 10 more iterations: 9m 10s, 100 more iterations: 1h 31m 43s, 500 more iterations: 7h 38m 35s.
[2025-08-20 14:11:14,410][__main__][INFO] - Starting iteration 385.
[2025-08-20 14:11:37,443][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 14:11:37,444][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 14:11:37,450][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 14:11:39,901][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 14:11:39,903][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 14:11:39,909][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 14:11:39,911][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data.
[2025-08-20 14:11:39,912][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets.
[2025-08-20 14:11:40,211][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:11:41,005][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:11:41,798][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:11:42,590][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:11:43,386][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:11:44,180][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:11:44,973][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:11:45,768][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:11:46,564][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:11:47,358][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:11:48,152][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:11:48,946][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:11:49,743][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:11:50,538][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:11:51,332][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:11:52,126][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:11:52,923][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:11:53,719][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:11:54,513][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:11:55,793][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:11:56,587][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:11:57,383][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:11:58,179][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:11:58,974][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:11:59,769][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:12:00,564][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:12:01,361][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:12:02,157][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:12:02,953][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:12:03,747][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:12:04,542][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:12:05,341][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:12:06,967][mllm.training.trainer_common][INFO] - For task: Apply reinforce step
,
            ΔVRAM Allocated: 2.288818359375e-05 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:27,
            Percentage of VRAM taken: 69.91579512747062%,
            
[2025-08-20 14:12:07,996][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/policy_optimizer_state.pt
[2025-08-20 14:12:07,998][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/critic_optimizer_state.pt
[2025-08-20 14:12:09,494][__main__][INFO] - Iteration 386 took 55s (37.36% Gen, 62.64% Train). Generation: 20s, Training: 34s. Estimated remaining time: 9h 16m 25s. Estimated total time: 15h 18m 4s. Time estimates for 10 more iterations: 9m 10s, 100 more iterations: 1h 31m 48s, 500 more iterations: 7h 39m 2s.
[2025-08-20 14:12:09,496][__main__][INFO] - Starting iteration 386.
[2025-08-20 14:12:33,004][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 14:12:33,005][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 14:12:33,012][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 14:12:35,439][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 14:12:35,441][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 14:12:35,447][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 14:12:35,450][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data.
[2025-08-20 14:12:35,450][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets.
[2025-08-20 14:12:35,748][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:12:36,535][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:12:37,327][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:12:38,117][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:12:38,909][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:12:39,701][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:12:40,492][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:12:41,286][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:12:42,078][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:12:42,869][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:12:43,664][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:12:44,456][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:12:45,248][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:12:46,044][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:12:46,835][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:12:47,627][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:12:48,421][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:12:49,213][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:12:50,453][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:12:51,246][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:12:52,043][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:12:52,837][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:12:53,631][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:12:54,428][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:12:55,225][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:12:56,022][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:12:56,819][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:12:57,615][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:12:58,410][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:12:59,207][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:13:00,005][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:13:00,804][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:13:02,473][mllm.training.trainer_common][INFO] - For task: Apply reinforce step
,
            ΔVRAM Allocated: 2.288818359375e-05 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:26,
            Percentage of VRAM taken: 69.91579512747062%,
            
[2025-08-20 14:13:03,486][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/policy_optimizer_state.pt
[2025-08-20 14:13:03,489][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/critic_optimizer_state.pt
[2025-08-20 14:13:04,837][__main__][INFO] - Iteration 387 took 55s (38.10% Gen, 61.90% Train). Generation: 21s, Training: 34s. Estimated remaining time: 9h 19m 46s. Estimated total time: 15h 22m 20s. Time estimates for 10 more iterations: 9m 13s, 100 more iterations: 1h 32m 14s, 500 more iterations: 7h 41m 10s.
[2025-08-20 14:13:04,839][__main__][INFO] - Starting iteration 387.
[2025-08-20 14:13:27,897][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 14:13:27,898][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 14:13:27,904][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 14:13:30,369][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 14:13:30,371][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 14:13:30,377][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 14:13:30,379][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data.
[2025-08-20 14:13:30,380][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets.
[2025-08-20 14:13:30,679][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:13:31,471][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:13:32,265][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:13:33,057][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:13:33,852][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:13:34,646][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:13:35,438][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:13:36,233][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:13:37,029][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:13:37,823][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:13:38,616][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:13:39,412][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:13:40,208][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:13:41,003][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:13:41,797][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:13:42,592][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:13:43,389][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:13:44,184][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:13:44,978][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:13:45,772][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:13:46,567][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:13:47,362][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:13:48,156][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:13:48,951][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:13:50,195][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:13:50,989][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:13:51,787][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:13:52,585][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:13:53,382][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:13:54,177][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:13:54,972][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:13:55,769][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:13:57,403][mllm.training.trainer_common][INFO] - For task: Apply reinforce step
,
            ΔVRAM Allocated: 2.288818359375e-05 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:26,
            Percentage of VRAM taken: 69.91579512747062%,
            
[2025-08-20 14:13:58,447][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/policy_optimizer_state.pt
[2025-08-20 14:13:58,449][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/critic_optimizer_state.pt
[2025-08-20 14:14:01,938][__main__][INFO] - Iteration 388 took 57s (36.08% Gen, 63.92% Train). Generation: 20s, Training: 36s. Estimated remaining time: 9h 48m 8s. Estimated total time: 15h 51m 39s. Time estimates for 10 more iterations: 9m 30s, 100 more iterations: 1h 35m 9s, 500 more iterations: 7h 55m 49s.
[2025-08-20 14:14:01,940][__main__][INFO] - Starting iteration 388.
[2025-08-20 14:14:26,634][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 14:14:26,636][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 14:14:26,642][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 14:14:29,097][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 14:14:29,098][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 14:14:29,105][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 14:14:29,107][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data.
[2025-08-20 14:14:29,108][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets.
[2025-08-20 14:14:29,407][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:14:30,196][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:14:30,990][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:14:31,783][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:14:32,576][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:14:33,370][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:14:34,164][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:14:34,958][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:14:35,753][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:14:36,549][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:14:37,341][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:14:38,135][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:14:38,931][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:14:39,726][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:14:40,519][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:14:41,313][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:14:42,110][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:14:43,412][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:14:44,211][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:14:45,006][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:14:45,799][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:14:46,593][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:14:47,392][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:14:48,190][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:14:48,988][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:14:49,785][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:14:50,580][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:14:51,374][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:14:52,171][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:14:52,969][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:14:53,768][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:14:54,566][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:14:56,211][mllm.training.trainer_common][INFO] - For task: Apply reinforce step
,
            ΔVRAM Allocated: 2.288818359375e-05 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:27,
            Percentage of VRAM taken: 69.91579512747062%,
            
[2025-08-20 14:14:57,239][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/policy_optimizer_state.pt
[2025-08-20 14:14:57,242][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/critic_optimizer_state.pt
[2025-08-20 14:14:58,640][__main__][INFO] - Iteration 389 took 56s (39.19% Gen, 60.81% Train). Generation: 22s, Training: 34s. Estimated remaining time: 9h 40m 31s. Estimated total time: 15h 44m 59s. Time estimates for 10 more iterations: 9m 26s, 100 more iterations: 1h 34m 29s, 500 more iterations: 7h 52m 29s.
[2025-08-20 14:14:58,641][__main__][INFO] - Starting iteration 389.
[2025-08-20 14:15:21,650][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 14:15:21,651][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 14:15:21,658][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 14:15:24,123][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 14:15:24,124][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 14:15:24,130][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 14:15:24,132][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data.
[2025-08-20 14:15:24,133][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets.
[2025-08-20 14:15:24,431][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:15:25,222][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:15:26,016][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:15:26,810][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:15:27,604][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:15:28,397][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:15:29,193][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:15:29,986][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:15:30,779][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:15:31,575][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:15:32,370][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:15:33,163][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:15:33,957][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:15:34,753][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:15:35,548][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:15:36,750][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:15:37,544][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:15:38,339][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:15:39,135][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:15:39,929][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:15:40,724][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:15:41,518][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:15:42,314][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:15:43,113][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:15:43,910][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:15:44,707][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:15:45,503][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:15:46,298][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:15:47,095][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:15:47,892][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:15:48,691][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:15:49,488][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:15:51,142][mllm.training.trainer_common][INFO] - For task: Apply reinforce step
,
            ΔVRAM Allocated: 2.288818359375e-05 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:26,
            Percentage of VRAM taken: 69.91579512747062%,
            
[2025-08-20 14:15:52,237][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/policy_optimizer_state.pt
[2025-08-20 14:15:52,239][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/critic_optimizer_state.pt
[2025-08-20 14:15:54,059][__main__][INFO] - Iteration 390 took 55s (37.10% Gen, 62.90% Train). Generation: 20s, Training: 34s. Estimated remaining time: 9h 18m 14s. Estimated total time: 15h 23m 37s. Time estimates for 10 more iterations: 9m 14s, 100 more iterations: 1h 32m 21s, 500 more iterations: 7h 41m 48s.
[2025-08-20 14:15:54,061][__main__][INFO] - Starting iteration 390.
[2025-08-20 14:16:17,059][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 14:16:17,060][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 14:16:17,066][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 14:16:19,525][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 14:16:19,527][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 14:16:19,533][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 14:16:19,535][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data.
[2025-08-20 14:16:19,536][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets.
[2025-08-20 14:16:19,840][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:16:20,632][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:16:21,424][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:16:22,219][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:16:23,013][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:16:23,805][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:16:24,600][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:16:25,396][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:16:26,190][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:16:26,982][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:16:27,778][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:16:28,573][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:16:29,368][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:16:30,161][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:16:30,957][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:16:31,753][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:16:32,547][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:16:33,342][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:16:34,137][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:16:35,420][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:16:36,217][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:16:37,015][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:16:37,813][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:16:38,610][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:16:39,405][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:16:40,199][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:16:40,997][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:16:41,793][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:16:42,590][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:16:43,385][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:16:44,181][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:16:44,977][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:16:46,584][mllm.training.trainer_common][INFO] - For task: Apply reinforce step
,
            ΔVRAM Allocated: 2.288818359375e-05 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:27,
            Percentage of VRAM taken: 69.91579512747062%,
            
[2025-08-20 14:16:47,461][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/policy_optimizer_state.pt
[2025-08-20 14:16:47,463][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/critic_optimizer_state.pt
[2025-08-20 14:16:48,950][__main__][INFO] - Iteration 391 took 54s (37.43% Gen, 62.57% Train). Generation: 20s, Training: 34s. Estimated remaining time: 9h 8m 31s. Estimated total time: 15h 14m 49s. Time estimates for 10 more iterations: 9m 8s, 100 more iterations: 1h 31m 28s, 500 more iterations: 7h 37m 24s.
[2025-08-20 14:16:48,952][__main__][INFO] - Starting iteration 391.
[2025-08-20 14:17:12,276][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 14:17:12,277][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 14:17:12,283][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 14:17:14,740][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 14:17:14,742][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 14:17:14,748][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 14:17:14,750][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data.
[2025-08-20 14:17:14,751][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets.
[2025-08-20 14:17:15,051][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:17:15,842][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:17:16,635][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:17:17,427][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:17:18,222][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:17:19,017][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:17:19,810][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:17:20,604][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:17:21,400][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:17:22,196][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:17:22,989][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:17:23,783][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:17:24,579][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:17:25,374][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:17:26,168][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:17:26,962][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:17:27,760][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:17:28,555][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:17:29,349][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:17:30,143][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:17:30,940][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:17:31,738][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:17:32,536][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:17:33,331][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:17:34,126][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:17:34,921][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:17:35,719][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:17:36,517][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:17:37,826][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:17:38,618][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:17:39,415][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:17:40,210][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:17:41,827][mllm.training.trainer_common][INFO] - For task: Apply reinforce step
,
            ΔVRAM Allocated: 2.288818359375e-05 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:27,
            Percentage of VRAM taken: 69.91579512747062%,
            
[2025-08-20 14:17:42,749][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/policy_optimizer_state.pt
[2025-08-20 14:17:42,750][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/critic_optimizer_state.pt
[2025-08-20 14:17:44,021][__main__][INFO] - Iteration 392 took 55s (37.92% Gen, 62.07% Train). Generation: 20s, Training: 34s. Estimated remaining time: 9h 10m 36s. Estimated total time: 15h 17m 49s. Time estimates for 10 more iterations: 9m 10s, 100 more iterations: 1h 31m 46s, 500 more iterations: 7h 38m 54s.
[2025-08-20 14:17:44,023][__main__][INFO] - Starting iteration 392.
[2025-08-20 14:18:06,913][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 14:18:06,914][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 14:18:06,920][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 14:18:09,361][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 14:18:09,362][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 14:18:09,369][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 14:18:09,371][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data.
[2025-08-20 14:18:09,372][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets.
[2025-08-20 14:18:09,670][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:18:10,463][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:18:11,254][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:18:12,046][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:18:12,843][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:18:13,636][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:18:14,429][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:18:15,225][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:18:16,021][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:18:16,815][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:18:17,608][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:18:18,404][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:18:19,200][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:18:19,993][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:18:20,786][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:18:21,583][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:18:22,382][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:18:23,178][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:18:23,972][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:18:25,215][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:18:26,009][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:18:26,804][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:18:27,601][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:18:28,398][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:18:29,194][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:18:29,990][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:18:30,786][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:18:31,583][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:18:32,381][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:18:33,181][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:18:33,980][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:18:34,780][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:18:36,416][mllm.training.trainer_common][INFO] - For task: Apply reinforce step
,
            ΔVRAM Allocated: 2.288818359375e-05 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:27,
            Percentage of VRAM taken: 69.91579512747062%,
            
[2025-08-20 14:18:37,355][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/policy_optimizer_state.pt
[2025-08-20 14:18:37,357][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/critic_optimizer_state.pt
[2025-08-20 14:18:38,782][__main__][INFO] - Iteration 393 took 54s (37.35% Gen, 62.65% Train). Generation: 20s, Training: 34s. Estimated remaining time: 9h 4m 30s. Estimated total time: 15h 12m 38s. Time estimates for 10 more iterations: 9m 7s, 100 more iterations: 1h 31m 15s, 500 more iterations: 7h 36m 19s.
[2025-08-20 14:18:38,783][__main__][INFO] - Starting iteration 393.
[2025-08-20 14:19:01,670][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 14:19:01,672][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 14:19:01,678][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 14:19:04,133][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 14:19:04,134][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 14:19:04,141][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 14:19:04,143][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data.
[2025-08-20 14:19:04,143][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets.
[2025-08-20 14:19:04,442][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:19:05,231][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:19:06,024][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:19:06,815][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:19:07,607][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:19:08,400][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:19:09,192][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:19:09,986][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:19:10,780][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:19:11,572][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:19:12,366][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:19:13,160][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:19:13,952][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:19:14,746][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:19:15,540][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:19:16,333][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:19:17,127][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:19:17,923][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:19:18,718][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:19:19,512][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:19:20,307][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:19:21,105][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:19:22,367][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:19:23,162][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:19:23,959][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:19:24,754][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:19:25,550][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:19:26,346][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:19:27,144][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:19:27,943][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:19:28,741][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:19:29,538][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:19:31,156][mllm.training.trainer_common][INFO] - For task: Apply reinforce step
,
            ΔVRAM Allocated: 2.288818359375e-05 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:26,
            Percentage of VRAM taken: 69.91579512747062%,
            
[2025-08-20 14:19:32,088][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/policy_optimizer_state.pt
[2025-08-20 14:19:32,089][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/critic_optimizer_state.pt
[2025-08-20 14:19:33,458][__main__][INFO] - Iteration 394 took 54s (37.40% Gen, 62.60% Train). Generation: 20s, Training: 34s. Estimated remaining time: 9h 2m 11s. Estimated total time: 15h 11m 14s. Time estimates for 10 more iterations: 9m 6s, 100 more iterations: 1h 31m 7s, 500 more iterations: 7h 35m 37s.
[2025-08-20 14:19:33,459][__main__][INFO] - Starting iteration 394.
[2025-08-20 14:19:56,351][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 14:19:56,352][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 14:19:56,358][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 14:19:58,810][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 14:19:58,812][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 14:19:58,818][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 14:19:58,820][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data.
[2025-08-20 14:19:58,821][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets.
[2025-08-20 14:19:59,121][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:19:59,911][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:20:00,703][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:20:01,493][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:20:02,288][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:20:03,080][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:20:03,871][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:20:04,666][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:20:05,458][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:20:06,251][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:20:07,046][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:20:07,839][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:20:08,632][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:20:09,428][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:20:10,223][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:20:11,016][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:20:11,811][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:20:12,607][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:20:13,402][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:20:14,196][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:20:14,991][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:20:16,280][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:20:17,074][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:20:17,869][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:20:18,667][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:20:19,466][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:20:20,265][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:20:21,063][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:20:21,860][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:20:22,655][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:20:23,451][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:20:24,249][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:20:25,884][mllm.training.trainer_common][INFO] - For task: Apply reinforce step
,
            ΔVRAM Allocated: 2.288818359375e-05 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:27,
            Percentage of VRAM taken: 69.91579512747062%,
            
[2025-08-20 14:20:26,837][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/policy_optimizer_state.pt
[2025-08-20 14:20:26,839][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/critic_optimizer_state.pt
[2025-08-20 14:20:28,469][__main__][INFO] - Iteration 395 took 55s (37.19% Gen, 62.81% Train). Generation: 20s, Training: 34s. Estimated remaining time: 9h 6m 52s. Estimated total time: 15h 16m 49s. Time estimates for 10 more iterations: 9m 10s, 100 more iterations: 1h 31m 40s, 500 more iterations: 7h 38m 24s.
[2025-08-20 14:20:28,471][__main__][INFO] - Starting iteration 395.
[2025-08-20 14:20:51,438][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 14:20:51,439][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 14:20:51,445][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 14:20:53,896][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 14:20:53,897][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 14:20:53,904][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 14:20:53,906][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data.
[2025-08-20 14:20:53,907][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets.
[2025-08-20 14:20:54,204][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:20:54,993][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:20:55,785][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:20:56,576][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:20:57,371][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:20:58,163][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:20:58,954][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:20:59,749][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:21:00,542][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:21:01,334][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:21:02,130][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:21:02,923][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:21:03,715][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:21:04,510][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:21:05,304][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:21:06,098][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:21:06,893][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:21:07,689][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:21:08,893][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:21:09,690][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:21:10,486][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:21:11,280][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:21:12,074][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:21:12,871][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:21:13,670][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:21:14,468][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:21:15,266][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:21:16,062][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:21:16,857][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:21:17,652][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:21:18,450][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:21:19,249][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:21:20,897][mllm.training.trainer_common][INFO] - For task: Apply reinforce step
,
            ΔVRAM Allocated: 2.288818359375e-05 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:26,
            Percentage of VRAM taken: 69.91579512747062%,
            
[2025-08-20 14:21:21,907][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/policy_optimizer_state.pt
[2025-08-20 14:21:21,909][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/critic_optimizer_state.pt
[2025-08-20 14:21:23,274][__main__][INFO] - Iteration 396 took 54s (37.45% Gen, 62.55% Train). Generation: 20s, Training: 34s. Estimated remaining time: 9h 2m 30s. Estimated total time: 15h 13m 22s. Time estimates for 10 more iterations: 9m 8s, 100 more iterations: 1h 31m 20s, 500 more iterations: 7h 36m 41s.
[2025-08-20 14:21:23,275][__main__][INFO] - Starting iteration 396.
[2025-08-20 14:21:46,606][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 14:21:46,608][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 14:21:46,614][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 14:21:49,080][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 14:21:49,081][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 14:21:49,088][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 14:21:49,090][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data.
[2025-08-20 14:21:49,090][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets.
[2025-08-20 14:21:49,396][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:21:50,186][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:21:50,976][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:21:51,768][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:21:52,559][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:21:53,354][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:21:54,147][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:21:54,939][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:21:55,734][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:21:56,527][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:21:57,319][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:21:58,114][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:21:58,909][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:21:59,701][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:22:00,494][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:22:01,289][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:22:02,081][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:22:03,306][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:22:04,100][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:22:04,896][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:22:05,692][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:22:06,488][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:22:07,283][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:22:08,078][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:22:08,874][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:22:09,672][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:22:10,470][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:22:11,266][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:22:12,061][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:22:12,857][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:22:13,654][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:22:14,452][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:22:16,062][mllm.training.trainer_common][INFO] - For task: Apply reinforce step
,
            ΔVRAM Allocated: 2.288818359375e-05 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:26,
            Percentage of VRAM taken: 69.91579512747062%,
            
[2025-08-20 14:22:17,036][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/policy_optimizer_state.pt
[2025-08-20 14:22:17,037][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/critic_optimizer_state.pt
[2025-08-20 14:22:18,422][__main__][INFO] - Iteration 397 took 55s (37.81% Gen, 62.19% Train). Generation: 20s, Training: 34s. Estimated remaining time: 9h 7m 18s. Estimated total time: 15h 19m 6s. Time estimates for 10 more iterations: 9m 11s, 100 more iterations: 1h 31m 54s, 500 more iterations: 7h 39m 33s.
[2025-08-20 14:22:18,424][__main__][INFO] - Starting iteration 397.
[2025-08-20 14:22:41,502][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 14:22:41,503][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 14:22:41,510][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 14:22:43,961][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 14:22:43,963][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 14:22:43,969][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 14:22:43,971][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data.
[2025-08-20 14:22:43,972][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets.
[2025-08-20 14:22:44,270][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:22:45,058][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:22:45,850][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:22:46,641][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:22:47,437][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:22:48,229][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:22:49,020][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:22:49,815][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:22:50,608][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:22:51,399][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:22:52,196][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:22:52,990][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:22:53,782][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:22:54,578][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:22:55,373][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:22:56,166][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:22:56,959][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:22:57,755][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:22:58,550][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:22:59,344][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:23:00,138][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:23:00,931][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:23:01,837][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:23:02,641][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:23:03,438][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:23:04,752][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:23:05,548][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:23:06,343][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:23:07,139][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:23:07,935][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:23:08,733][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:23:09,531][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:23:11,164][mllm.training.trainer_common][INFO] - For task: Apply reinforce step
,
            ΔVRAM Allocated: 2.288818359375e-05 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:27,
            Percentage of VRAM taken: 69.91579512747062%,
            
[2025-08-20 14:23:12,137][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/policy_optimizer_state.pt
[2025-08-20 14:23:12,138][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/critic_optimizer_state.pt
[2025-08-20 14:23:13,465][__main__][INFO] - Iteration 398 took 55s (37.50% Gen, 62.50% Train). Generation: 20s, Training: 34s. Estimated remaining time: 9h 4m 38s. Estimated total time: 15h 17m 20s. Time estimates for 10 more iterations: 9m 10s, 100 more iterations: 1h 31m 44s, 500 more iterations: 7h 38m 40s.
[2025-08-20 14:23:13,466][__main__][INFO] - Starting iteration 398.
[2025-08-20 14:23:36,618][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 14:23:36,620][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 14:23:36,626][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 14:23:39,083][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 14:23:39,085][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 14:23:39,091][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 14:23:39,093][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data.
[2025-08-20 14:23:39,094][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets.
[2025-08-20 14:23:39,392][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:23:40,182][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:23:40,976][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:23:41,768][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:23:42,563][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:23:43,357][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:23:44,150][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:23:44,944][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:23:45,741][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:23:46,537][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:23:47,331][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:23:48,125][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:23:48,921][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:23:49,718][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:23:50,513][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:23:51,307][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:23:52,102][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:23:52,899][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:23:53,697][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:23:54,493][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:23:55,287][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:23:56,082][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:23:56,880][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:23:57,679][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:23:58,477][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:23:59,739][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:24:00,534][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:24:01,329][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:24:02,124][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:24:02,920][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:24:03,717][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:24:04,516][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:24:06,191][mllm.training.trainer_common][INFO] - For task: Apply reinforce step
,
            ΔVRAM Allocated: 2.288818359375e-05 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:27,
            Percentage of VRAM taken: 69.91579512747062%,
            
[2025-08-20 14:24:07,160][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/policy_optimizer_state.pt
[2025-08-20 14:24:07,162][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/critic_optimizer_state.pt
[2025-08-20 14:24:08,495][__main__][INFO] - Iteration 399 took 55s (37.62% Gen, 62.38% Train). Generation: 20s, Training: 34s. Estimated remaining time: 9h 3m 30s. Estimated total time: 15h 17m 8s. Time estimates for 10 more iterations: 9m 10s, 100 more iterations: 1h 31m 42s, 500 more iterations: 7h 38m 34s.
[2025-08-20 14:24:08,496][__main__][INFO] - Starting iteration 399.
[2025-08-20 14:24:32,972][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 14:24:32,974][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 14:24:32,980][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 14:24:35,446][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 14:24:35,447][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 14:24:35,453][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 14:24:35,456][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data.
[2025-08-20 14:24:35,456][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets.
[2025-08-20 14:24:35,755][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:24:36,545][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:24:37,337][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:24:38,128][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:24:38,923][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:24:39,717][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:24:40,509][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:24:41,304][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:24:42,099][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:24:42,891][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:24:43,685][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:24:44,481][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:24:45,276][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:24:46,069][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:24:46,864][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:24:47,662][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:24:48,459][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:24:49,252][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:24:50,047][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:24:50,843][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:24:51,642][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:24:52,439][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:24:53,236][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:24:54,532][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:24:55,327][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:24:56,123][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:24:56,921][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:24:57,719][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:24:58,517][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:24:59,314][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:25:00,110][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:25:00,905][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:25:02,515][mllm.training.trainer_common][INFO] - For task: Apply reinforce step
,
            ΔVRAM Allocated: 2.288818359375e-05 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:27,
            Percentage of VRAM taken: 69.91579512747062%,
            
[2025-08-20 14:25:03,453][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/policy_optimizer_state.pt
[2025-08-20 14:25:03,454][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/critic_optimizer_state.pt
[2025-08-20 14:25:05,205][__main__][INFO] - Iteration 400 took 56s (38.82% Gen, 61.18% Train). Generation: 22s, Training: 34s. Estimated remaining time: 9h 30m 33s. Estimated total time: 15h 45m 7s. Time estimates for 10 more iterations: 9m 27s, 100 more iterations: 1h 34m 30s, 500 more iterations: 7h 52m 33s.
[2025-08-20 14:25:05,206][__main__][INFO] - Starting iteration 400.
[2025-08-20 14:25:28,307][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 14:25:28,308][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 14:25:28,315][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 14:25:30,768][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 14:25:30,769][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 14:25:30,776][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 14:25:30,778][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data.
[2025-08-20 14:25:30,778][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets.
[2025-08-20 14:25:31,079][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:25:31,869][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:25:32,665][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:25:33,458][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:25:34,251][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:25:35,046][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:25:35,841][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:25:36,634][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:25:37,428][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:25:38,224][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:25:39,017][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:25:39,810][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:25:40,606][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:25:41,402][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:25:42,196][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:25:42,991][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:25:43,786][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:25:45,072][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:25:45,866][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:25:46,664][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:25:47,461][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:25:48,257][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:25:49,052][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:25:49,847][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:25:50,645][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:25:51,443][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:25:52,241][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:25:53,038][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:25:53,833][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:25:54,629][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:25:55,425][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:25:56,223][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:25:57,835][mllm.training.trainer_common][INFO] - For task: Apply reinforce step
,
            ΔVRAM Allocated: 2.288818359375e-05 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:27,
            Percentage of VRAM taken: 69.91579512747062%,
            
[2025-08-20 14:25:58,806][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/policy_optimizer_state.pt
[2025-08-20 14:25:58,807][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/critic_optimizer_state.pt
[2025-08-20 14:26:03,261][__main__][INFO] - Iteration 401 took 58s (35.58% Gen, 58.99% Train). Generation: 20s, Training: 34s. Estimated remaining time: 9h 52m 2s. Estimated total time: 16h 7m 34s. Time estimates for 10 more iterations: 9m 40s, 100 more iterations: 1h 36m 45s, 500 more iterations: 8h 3m 47s.
[2025-08-20 14:26:03,262][__main__][INFO] - Starting iteration 401.
[2025-08-20 14:26:26,602][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 14:26:26,604][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 14:26:26,611][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 14:26:29,062][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 14:26:29,063][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 14:26:29,070][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 14:26:29,072][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data.
[2025-08-20 14:26:29,073][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets.
[2025-08-20 14:26:29,371][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:26:30,164][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:26:30,957][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:26:31,750][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:26:32,544][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:26:33,337][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:26:34,130][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:26:34,926][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:26:35,721][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:26:36,514][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:26:37,309][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:26:38,106][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:26:38,900][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:26:39,694][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:26:40,489][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:26:41,286][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:26:42,083][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:26:42,877][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:26:43,672][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:26:44,900][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:26:45,694][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:26:46,489][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:26:47,286][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:26:48,084][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:26:48,881][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:26:49,676][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:26:50,472][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:26:51,267][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:26:52,065][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:26:52,864][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:26:53,662][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:26:54,458][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:26:56,106][mllm.training.trainer_common][INFO] - For task: Apply reinforce step
,
            ΔVRAM Allocated: 2.288818359375e-05 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:26,
            Percentage of VRAM taken: 69.91579512747062%,
            
[2025-08-20 14:26:57,028][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/policy_optimizer_state.pt
[2025-08-20 14:26:57,030][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/critic_optimizer_state.pt
[2025-08-20 14:26:58,433][__main__][INFO] - Iteration 402 took 55s (37.89% Gen, 62.11% Train). Generation: 20s, Training: 34s. Estimated remaining time: 9h 3m 3s. Estimated total time: 15h 19m 30s. Time estimates for 10 more iterations: 9m 11s, 100 more iterations: 1h 31m 57s, 500 more iterations: 7h 39m 45s.
[2025-08-20 14:26:58,435][__main__][INFO] - Starting iteration 402.
[2025-08-20 14:27:21,373][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 14:27:21,374][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 14:27:21,381][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 14:27:23,858][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 14:27:23,859][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 14:27:23,866][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 14:27:23,868][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data.
[2025-08-20 14:27:23,869][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets.
[2025-08-20 14:27:24,168][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:27:24,958][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:27:25,752][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:27:26,547][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:27:27,340][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:27:28,134][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:27:28,931][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:27:29,726][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:27:30,519][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:27:31,315][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:27:32,111][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:27:32,908][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:27:33,702][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:27:34,496][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:27:35,292][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:27:36,088][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:27:36,884][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:27:37,678][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:27:38,472][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:27:39,271][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:27:40,069][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:27:40,867][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:27:41,664][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:27:42,459][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:27:43,254][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:27:44,051][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:27:45,311][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:27:46,107][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:27:46,906][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:27:47,703][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:27:48,499][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:27:49,296][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:27:50,918][mllm.training.trainer_common][INFO] - For task: Apply reinforce step
,
            ΔVRAM Allocated: 2.288818359375e-05 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:27,
            Percentage of VRAM taken: 69.91579512747062%,
            
[2025-08-20 14:27:51,866][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/policy_optimizer_state.pt
[2025-08-20 14:27:51,867][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/critic_optimizer_state.pt
[2025-08-20 14:27:53,233][__main__][INFO] - Iteration 403 took 54s (37.38% Gen, 62.62% Train). Generation: 20s, Training: 34s. Estimated remaining time: 8h 55m 55s. Estimated total time: 15h 13m 17s. Time estimates for 10 more iterations: 9m 7s, 100 more iterations: 1h 31m 19s, 500 more iterations: 7h 36m 38s.
[2025-08-20 14:27:53,234][__main__][INFO] - Starting iteration 403.
[2025-08-20 14:28:16,033][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 14:28:16,034][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 14:28:16,041][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 14:28:18,504][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 14:28:18,506][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 14:28:18,512][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 14:28:18,514][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data.
[2025-08-20 14:28:18,515][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets.
[2025-08-20 14:28:18,815][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:28:19,607][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:28:20,398][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:28:21,193][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:28:21,987][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:28:22,779][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:28:23,574][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:28:24,370][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:28:25,164][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:28:25,958][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:28:26,753][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:28:27,547][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:28:28,341][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:28:29,136][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:28:29,933][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:28:30,730][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:28:31,525][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:28:32,320][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:28:33,115][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:28:33,913][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:28:34,712][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:28:36,037][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:28:36,832][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:28:37,631][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:28:38,429][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:28:39,226][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:28:40,023][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:28:40,819][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:28:41,614][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:28:42,412][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:28:43,209][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:28:44,007][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:28:45,634][mllm.training.trainer_common][INFO] - For task: Apply reinforce step
,
            ΔVRAM Allocated: 2.288818359375e-05 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:27,
            Percentage of VRAM taken: 69.91579512747062%,
            
[2025-08-20 14:28:46,575][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/policy_optimizer_state.pt
[2025-08-20 14:28:46,576][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/critic_optimizer_state.pt
[2025-08-20 14:28:48,001][__main__][INFO] - Iteration 404 took 54s (37.16% Gen, 62.84% Train). Generation: 20s, Training: 34s. Estimated remaining time: 8h 54m 29s. Estimated total time: 15h 12m 46s. Time estimates for 10 more iterations: 9m 7s, 100 more iterations: 1h 31m 16s, 500 more iterations: 7h 36m 23s.
[2025-08-20 14:28:48,002][__main__][INFO] - Starting iteration 404.
[2025-08-20 14:29:10,966][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 14:29:10,968][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 14:29:10,974][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 14:29:13,431][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 14:29:13,432][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 14:29:13,438][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 14:29:13,440][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data.
[2025-08-20 14:29:13,441][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets.
[2025-08-20 14:29:13,739][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:29:14,530][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:29:15,321][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:29:16,116][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:29:16,907][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:29:17,699][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:29:18,494][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:29:19,286][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:29:20,079][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:29:20,875][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:29:21,667][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:29:22,459][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:29:23,255][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:29:24,051][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:29:24,844][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:29:25,638][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:29:26,435][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:29:27,231][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:29:28,474][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:29:29,271][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:29:30,067][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:29:30,861][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:29:31,658][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:29:32,455][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:29:33,254][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:29:34,052][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:29:34,850][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:29:35,646][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:29:36,441][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:29:37,236][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:29:38,033][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:29:38,829][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:29:40,437][mllm.training.trainer_common][INFO] - For task: Apply reinforce step
,
            ΔVRAM Allocated: 2.288818359375e-05 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:26,
            Percentage of VRAM taken: 69.91579512747062%,
            
[2025-08-20 14:29:41,386][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/policy_optimizer_state.pt
[2025-08-20 14:29:41,387][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/critic_optimizer_state.pt
[2025-08-20 14:29:42,896][__main__][INFO] - Iteration 405 took 54s (37.34% Gen, 62.66% Train). Generation: 20s, Training: 34s. Estimated remaining time: 8h 55m 40s. Estimated total time: 15h 14m 52s. Time estimates for 10 more iterations: 9m 8s, 100 more iterations: 1h 31m 29s, 500 more iterations: 7h 37m 26s.
[2025-08-20 14:29:42,897][__main__][INFO] - Starting iteration 405.
[2025-08-20 14:30:05,765][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 14:30:05,767][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 14:30:05,773][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 14:30:08,225][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 14:30:08,227][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 14:30:08,233][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 14:30:08,235][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data.
[2025-08-20 14:30:08,236][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets.
[2025-08-20 14:30:08,534][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:30:09,323][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:30:10,117][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:30:10,907][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:30:11,700][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:30:12,494][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:30:13,285][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:30:14,078][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:30:14,872][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:30:15,664][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:30:16,458][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:30:17,252][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:30:18,045][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:30:18,840][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:30:19,637][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:30:20,433][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:30:21,226][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:30:22,020][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:30:22,818][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:30:24,070][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:30:24,865][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:30:25,660][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:30:26,458][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:30:27,256][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:30:28,053][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:30:28,849][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:30:29,645][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:30:30,439][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:30:31,237][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:30:32,035][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:30:32,833][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:30:33,631][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:30:35,241][mllm.training.trainer_common][INFO] - For task: Apply reinforce step
,
            ΔVRAM Allocated: 2.288818359375e-05 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:26,
            Percentage of VRAM taken: 69.91579512747062%,
            
[2025-08-20 14:30:36,301][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/policy_optimizer_state.pt
[2025-08-20 14:30:36,303][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/critic_optimizer_state.pt
[2025-08-20 14:30:37,742][__main__][INFO] - Iteration 406 took 54s (37.22% Gen, 62.78% Train). Generation: 20s, Training: 34s. Estimated remaining time: 8h 53m 57s. Estimated total time: 15h 14m 3s. Time estimates for 10 more iterations: 9m 8s, 100 more iterations: 1h 31m 24s, 500 more iterations: 7h 37m 1s.
[2025-08-20 14:30:37,743][__main__][INFO] - Starting iteration 406.
[2025-08-20 14:31:00,983][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 14:31:00,984][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 14:31:00,990][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 14:31:03,449][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 14:31:03,451][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 14:31:03,457][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 14:31:03,460][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data.
[2025-08-20 14:31:03,460][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets.
[2025-08-20 14:31:03,758][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:31:04,547][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:31:05,342][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:31:06,136][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:31:06,928][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:31:07,723][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:31:08,517][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:31:09,312][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:31:10,106][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:31:10,902][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:31:11,698][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:31:12,492][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:31:13,288][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:31:14,084][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:31:14,882][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:31:15,679][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:31:16,475][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:31:17,269][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:31:18,065][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:31:18,861][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:31:20,157][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:31:20,951][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:31:21,745][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:31:22,540][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:31:23,339][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:31:24,138][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:31:24,936][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:31:25,731][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:31:26,525][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:31:27,321][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:31:28,120][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:31:28,919][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:31:30,575][mllm.training.trainer_common][INFO] - For task: Apply reinforce step
,
            ΔVRAM Allocated: 2.288818359375e-05 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:27,
            Percentage of VRAM taken: 69.91579512747062%,
            
[2025-08-20 14:31:31,512][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/policy_optimizer_state.pt
[2025-08-20 14:31:31,514][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/critic_optimizer_state.pt
[2025-08-20 14:31:32,838][__main__][INFO] - Iteration 407 took 55s (37.73% Gen, 62.27% Train). Generation: 20s, Training: 34s. Estimated remaining time: 8h 57m 12s. Estimated total time: 15h 18m 14s. Time estimates for 10 more iterations: 9m 10s, 100 more iterations: 1h 31m 49s, 500 more iterations: 7h 39m 7s.
[2025-08-20 14:31:32,840][__main__][INFO] - Starting iteration 407.
[2025-08-20 14:31:55,833][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 14:31:55,835][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 14:31:55,841][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 14:31:58,286][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 14:31:58,287][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 14:31:58,293][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 14:31:58,296][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data.
[2025-08-20 14:31:58,296][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets.
[2025-08-20 14:31:58,594][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:31:59,385][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:32:00,176][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:32:00,967][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:32:01,761][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:32:02,553][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:32:03,347][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:32:04,143][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:32:04,935][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:32:05,728][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:32:06,525][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:32:07,321][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:32:08,116][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:32:08,909][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:32:09,703][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:32:10,499][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:32:11,293][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:32:12,087][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:32:12,884][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:32:13,681][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:32:14,477][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:32:15,701][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:32:16,496][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:32:17,289][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:32:18,083][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:32:18,880][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:32:19,675][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:32:20,468][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:32:21,263][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:32:22,060][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:32:22,856][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:32:23,650][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:32:25,254][mllm.training.trainer_common][INFO] - For task: Apply reinforce step
,
            ΔVRAM Allocated: 2.288818359375e-05 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:26,
            Percentage of VRAM taken: 69.91579512747062%,
            
[2025-08-20 14:32:26,476][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/policy_optimizer_state.pt
[2025-08-20 14:32:26,478][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/critic_optimizer_state.pt
[2025-08-20 14:32:27,931][__main__][INFO] - Iteration 408 took 55s (37.29% Gen, 62.70% Train). Generation: 20s, Training: 34s. Estimated remaining time: 8h 56m 14s. Estimated total time: 15h 18m 11s. Time estimates for 10 more iterations: 9m 10s, 100 more iterations: 1h 31m 49s, 500 more iterations: 7h 39m 5s.
[2025-08-20 14:32:27,933][__main__][INFO] - Starting iteration 408.
[2025-08-20 14:32:50,839][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 14:32:50,841][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 14:32:50,847][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 14:32:53,290][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 14:32:53,292][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 14:32:53,298][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 14:32:53,300][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data.
[2025-08-20 14:32:53,301][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets.
[2025-08-20 14:32:53,601][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:32:54,390][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:32:55,183][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:32:55,974][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:32:56,768][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:32:57,563][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:32:58,357][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:32:59,150][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:32:59,946][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:33:00,741][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:33:01,535][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:33:02,330][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:33:03,128][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:33:03,925][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:33:04,719][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:33:05,513][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:33:06,310][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:33:07,108][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:33:07,907][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:33:08,704][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:33:09,500][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:33:10,296][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:33:11,091][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:33:11,888][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:33:13,190][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:33:13,984][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:33:14,782][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:33:15,579][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:33:16,375][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:33:17,170][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:33:17,966][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:33:18,765][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:33:20,376][mllm.training.trainer_common][INFO] - For task: Apply reinforce step
,
            ΔVRAM Allocated: 2.288818359375e-05 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:27,
            Percentage of VRAM taken: 69.91579512747062%,
            
[2025-08-20 14:33:21,295][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/policy_optimizer_state.pt
[2025-08-20 14:33:21,297][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/critic_optimizer_state.pt
[2025-08-20 14:33:22,670][__main__][INFO] - Iteration 409 took 54s (37.40% Gen, 62.60% Train). Generation: 20s, Training: 34s. Estimated remaining time: 8h 49m 24s. Estimated total time: 15h 12m 16s. Time estimates for 10 more iterations: 9m 7s, 100 more iterations: 1h 31m 13s, 500 more iterations: 7h 36m 8s.
[2025-08-20 14:33:22,671][__main__][INFO] - Starting iteration 409.
[2025-08-20 14:33:46,657][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 14:33:46,658][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 14:33:46,665][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 14:33:49,120][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 14:33:49,122][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 14:33:49,128][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 14:33:49,130][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data.
[2025-08-20 14:33:49,131][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets.
[2025-08-20 14:33:49,442][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:33:50,233][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:33:51,027][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:33:51,819][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:33:52,614][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:33:53,760][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:33:54,553][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:33:55,349][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:33:56,144][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:33:56,938][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:33:57,733][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:33:58,876][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:34:00,089][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:34:01,779][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:34:02,573][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:34:03,369][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:34:04,165][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:34:04,959][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:34:06,737][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:34:07,531][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:34:08,329][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:34:09,127][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:34:10,462][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:34:11,256][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:34:12,050][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:34:12,846][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:34:13,644][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:34:14,440][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:34:15,235][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:34:16,029][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:34:16,825][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:34:17,901][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:34:19,592][mllm.training.trainer_common][INFO] - For task: Apply reinforce step
,
            ΔVRAM Allocated: 2.288818359375e-05 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:30,
            Percentage of VRAM taken: 69.91579512747062%,
            
[2025-08-20 14:34:20,522][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/policy_optimizer_state.pt
[2025-08-20 14:34:20,523][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/critic_optimizer_state.pt
[2025-08-20 14:34:21,858][__main__][INFO] - Iteration 410 took 59s (36.38% Gen, 63.62% Train). Generation: 21s, Training: 37s. Estimated remaining time: 10h 2m 35s. Estimated total time: 16h 26m 25s. Time estimates for 10 more iterations: 9m 51s, 100 more iterations: 1h 38m 38s, 500 more iterations: 8h 13m 12s.
[2025-08-20 14:34:21,859][__main__][INFO] - Starting iteration 410.
[2025-08-20 14:34:44,888][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 14:34:44,890][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 14:34:44,896][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 14:34:47,368][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 14:34:47,370][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 14:34:47,376][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 14:34:47,378][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data.
[2025-08-20 14:34:47,379][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets.
[2025-08-20 14:34:47,677][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:34:48,469][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:34:49,261][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:34:50,055][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:34:50,852][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:34:51,646][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:34:52,439][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:34:53,234][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:34:54,031][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:34:54,825][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:34:55,620][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:34:56,414][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:34:57,212][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:34:58,009][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:34:59,246][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:35:00,041][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:35:00,835][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:35:01,631][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:35:02,429][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:35:03,224][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:35:04,019][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:35:04,813][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:35:05,611][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:35:06,409][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:35:07,207][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:35:08,002][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:35:08,797][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:35:09,592][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:35:10,388][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:35:11,183][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:35:11,977][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:35:12,770][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:35:14,331][mllm.training.trainer_common][INFO] - For task: Apply reinforce step
,
            ΔVRAM Allocated: 2.288818359375e-05 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:26,
            Percentage of VRAM taken: 69.91579512747062%,
            
[2025-08-20 14:35:15,269][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/policy_optimizer_state.pt
[2025-08-20 14:35:15,270][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/critic_optimizer_state.pt
[2025-08-20 14:35:16,673][__main__][INFO] - Iteration 411 took 54s (37.52% Gen, 62.47% Train). Generation: 20s, Training: 34s. Estimated remaining time: 8h 48m 47s. Estimated total time: 15h 13m 33s. Time estimates for 10 more iterations: 9m 8s, 100 more iterations: 1h 31m 21s, 500 more iterations: 7h 36m 46s.
[2025-08-20 14:35:16,674][__main__][INFO] - Starting iteration 411.
[2025-08-20 14:35:40,841][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 14:35:40,843][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 14:35:40,850][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 14:35:43,322][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 14:35:43,324][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 14:35:43,330][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 14:35:43,333][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data.
[2025-08-20 14:35:43,333][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets.
[2025-08-20 14:35:43,634][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:35:44,423][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:35:45,215][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:35:46,007][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:35:46,799][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:35:47,595][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:35:48,389][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:35:49,182][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:35:49,976][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:35:50,771][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:35:51,564][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:35:52,357][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:35:53,154][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:35:53,949][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:35:54,743][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:35:55,536][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:35:56,333][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:35:57,129][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:35:57,922][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:35:59,294][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:36:00,089][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:36:00,882][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:36:01,675][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:36:02,470][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:36:03,264][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:36:04,056][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:36:04,850][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:36:05,643][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:36:06,434][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:36:07,228][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:36:08,021][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:36:08,812][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:36:10,400][mllm.training.trainer_common][INFO] - For task: Apply reinforce step
,
            ΔVRAM Allocated: 2.288818359375e-05 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:27,
            Percentage of VRAM taken: 69.91579512747062%,
            
[2025-08-20 14:36:11,304][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/policy_optimizer_state.pt
[2025-08-20 14:36:11,306][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/critic_optimizer_state.pt
[2025-08-20 14:36:12,610][__main__][INFO] - Iteration 412 took 55s (38.81% Gen, 61.19% Train). Generation: 21s, Training: 34s. Estimated remaining time: 9h 6m 30s. Estimated total time: 15h 32m 12s. Time estimates for 10 more iterations: 9m 19s, 100 more iterations: 1h 33m 13s, 500 more iterations: 7h 46m 6s.
[2025-08-20 14:36:12,612][__main__][INFO] - Starting iteration 412.
[2025-08-20 14:36:35,572][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 14:36:35,574][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 14:36:35,580][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 14:36:38,011][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 14:36:38,012][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 14:36:38,019][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 14:36:38,021][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data.
[2025-08-20 14:36:38,022][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets.
[2025-08-20 14:36:38,320][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:36:39,111][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:36:39,903][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:36:40,698][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:36:41,493][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:36:42,285][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:36:43,080][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:36:43,877][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:36:44,672][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:36:45,465][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:36:46,260][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:36:47,057][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:36:47,853][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:36:48,647][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:36:49,441][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:36:50,237][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:36:51,036][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:36:51,832][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:36:52,627][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:36:53,421][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:36:54,217][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:36:55,481][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:36:56,276][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:36:57,074][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:36:57,869][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:36:58,663][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:36:59,456][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:37:00,249][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:37:01,042][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:37:01,834][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:37:02,631][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:37:03,425][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:37:05,033][mllm.training.trainer_common][INFO] - For task: Apply reinforce step
,
            ΔVRAM Allocated: 2.288818359375e-05 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:26,
            Percentage of VRAM taken: 69.91579512747062%,
            
[2025-08-20 14:37:05,975][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/policy_optimizer_state.pt
[2025-08-20 14:37:05,977][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/critic_optimizer_state.pt
[2025-08-20 14:37:07,719][__main__][INFO] - Iteration 413 took 55s (37.21% Gen, 62.79% Train). Generation: 20s, Training: 34s. Estimated remaining time: 8h 51m 49s. Estimated total time: 15h 18m 26s. Time estimates for 10 more iterations: 9m 11s, 100 more iterations: 1h 31m 50s, 500 more iterations: 7h 39m 13s.
[2025-08-20 14:37:07,720][__main__][INFO] - Starting iteration 413.
[2025-08-20 14:37:30,659][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 14:37:30,660][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 14:37:30,666][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 14:37:33,107][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 14:37:33,108][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 14:37:33,114][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 14:37:33,117][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data.
[2025-08-20 14:37:33,117][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets.
[2025-08-20 14:37:33,416][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:37:34,207][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:37:35,002][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:37:35,796][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:37:36,589][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:37:37,383][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:37:38,181][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:37:38,977][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:37:39,770][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:37:40,564][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:37:41,361][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:37:42,158][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:37:42,955][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:37:43,749][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:37:44,544][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:37:45,341][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:37:46,140][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:37:46,937][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:37:47,733][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:37:48,527][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:37:49,761][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:37:50,558][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:37:51,356][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:37:52,152][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:37:52,947][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:37:53,740][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:37:54,536][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:37:55,331][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:37:56,124][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:37:56,918][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:37:57,715][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:37:58,510][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:38:00,088][mllm.training.trainer_common][INFO] - For task: Apply reinforce step
,
            ΔVRAM Allocated: 2.288818359375e-05 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:26,
            Percentage of VRAM taken: 69.91579512747062%,
            
[2025-08-20 14:38:01,165][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/policy_optimizer_state.pt
[2025-08-20 14:38:01,167][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/critic_optimizer_state.pt
[2025-08-20 14:38:02,568][__main__][INFO] - Iteration 414 took 54s (37.36% Gen, 62.64% Train). Generation: 20s, Training: 34s. Estimated remaining time: 8h 46m 35s. Estimated total time: 15h 14m 7s. Time estimates for 10 more iterations: 9m 8s, 100 more iterations: 1h 31m 24s, 500 more iterations: 7h 37m 3s.
[2025-08-20 14:38:02,570][__main__][INFO] - Starting iteration 414.
[2025-08-20 14:38:26,610][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 14:38:26,612][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 14:38:26,618][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 14:38:29,077][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 14:38:29,079][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 14:38:29,085][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 14:38:29,088][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data.
[2025-08-20 14:38:29,088][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets.
[2025-08-20 14:38:29,390][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:38:30,182][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:38:30,974][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:38:31,766][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:38:32,562][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:38:33,355][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:38:34,149][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:38:34,944][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:38:35,740][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:38:36,534][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:38:37,328][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:38:38,124][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:38:38,922][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:38:39,717][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:38:40,512][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:38:41,306][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:38:42,103][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:38:42,902][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:38:43,700][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:38:44,496][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:38:45,291][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:38:46,086][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:38:46,883][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:38:47,681][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:38:48,478][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:38:49,274][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:38:50,548][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:38:51,340][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:38:52,137][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:38:52,931][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:38:53,724][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:38:54,519][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:38:56,108][mllm.training.trainer_common][INFO] - For task: Apply reinforce step
,
            ΔVRAM Allocated: 2.288818359375e-05 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:26,
            Percentage of VRAM taken: 69.91579512747062%,
            
[2025-08-20 14:38:57,137][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/policy_optimizer_state.pt
[2025-08-20 14:38:57,139][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/critic_optimizer_state.pt
[2025-08-20 14:38:58,556][__main__][INFO] - Iteration 415 took 55s (38.55% Gen, 61.45% Train). Generation: 21s, Training: 34s. Estimated remaining time: 9h 4m 38s. Estimated total time: 15h 33m 5s. Time estimates for 10 more iterations: 9m 19s, 100 more iterations: 1h 33m 18s, 500 more iterations: 7h 46m 32s.
[2025-08-20 14:38:58,557][__main__][INFO] - Starting iteration 415.
[2025-08-20 14:39:21,569][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 14:39:21,570][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 14:39:21,576][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 14:39:24,034][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 14:39:24,035][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 14:39:24,042][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 14:39:24,044][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data.
[2025-08-20 14:39:24,045][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets.
[2025-08-20 14:39:24,343][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:39:25,133][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:39:25,927][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:39:26,721][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:39:27,515][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:39:28,308][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:39:29,106][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:39:29,901][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:39:30,696][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:39:31,489][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:39:32,287][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:39:33,084][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:39:33,881][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:39:34,675][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:39:35,469][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:39:36,266][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:39:37,064][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:39:38,294][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:39:39,087][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:39:39,882][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:39:40,678][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:39:41,473][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:39:42,268][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:39:43,064][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:39:43,861][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:39:44,656][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:39:45,449][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:39:46,243][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:39:47,040][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:39:47,836][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:39:48,630][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:39:49,424][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:39:51,035][mllm.training.trainer_common][INFO] - For task: Apply reinforce step
,
            ΔVRAM Allocated: 2.288818359375e-05 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:26,
            Percentage of VRAM taken: 69.91579512747062%,
            
[2025-08-20 14:39:51,983][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/policy_optimizer_state.pt
[2025-08-20 14:39:51,985][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/critic_optimizer_state.pt
[2025-08-20 14:39:53,429][__main__][INFO] - Iteration 416 took 54s (37.48% Gen, 62.52% Train). Generation: 20s, Training: 34s. Estimated remaining time: 8h 45m 8s. Estimated total time: 15h 14m 31s. Time estimates for 10 more iterations: 9m 8s, 100 more iterations: 1h 31m 27s, 500 more iterations: 7h 37m 15s.
[2025-08-20 14:39:53,430][__main__][INFO] - Starting iteration 416.
[2025-08-20 14:40:16,707][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 14:40:16,708][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 14:40:16,715][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 14:40:19,168][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 14:40:19,170][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 14:40:19,176][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 14:40:19,179][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data.
[2025-08-20 14:40:19,179][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets.
[2025-08-20 14:40:19,478][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:40:20,269][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:40:21,063][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:40:21,856][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:40:22,651][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:40:23,447][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:40:24,242][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:40:25,035][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:40:25,831][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:40:26,628][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:40:27,424][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:40:28,218][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:40:29,013][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:40:29,809][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:40:30,608][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:40:31,405][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:40:32,200][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:40:32,995][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:40:33,791][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:40:34,588][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:40:35,387][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:40:36,652][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:40:37,447][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:40:38,244][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:40:39,040][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:40:39,833][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:40:40,627][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:40:41,423][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:40:42,218][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:40:43,012][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:40:43,807][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:40:44,604][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:40:46,179][mllm.training.trainer_common][INFO] - For task: Apply reinforce step
,
            ΔVRAM Allocated: 2.288818359375e-05 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:26,
            Percentage of VRAM taken: 69.91579512747062%,
            
[2025-08-20 14:40:47,809][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/policy_optimizer_state.pt
[2025-08-20 14:40:47,812][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/critic_optimizer_state.pt
[2025-08-20 14:40:49,373][__main__][INFO] - Iteration 417 took 55s (37.21% Gen, 62.79% Train). Generation: 20s, Training: 35s. Estimated remaining time: 9h 2m 4s. Estimated total time: 15h 32m 22s. Time estimates for 10 more iterations: 9m 19s, 100 more iterations: 1h 33m 14s, 500 more iterations: 7h 46m 11s.
[2025-08-20 14:40:49,375][__main__][INFO] - Starting iteration 417.
[2025-08-20 14:41:13,136][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 14:41:13,137][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 14:41:13,144][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 14:41:15,619][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 14:41:15,620][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 14:41:15,627][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 14:41:15,629][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data.
[2025-08-20 14:41:15,630][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets.
[2025-08-20 14:41:15,930][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:41:16,719][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:41:17,512][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:41:18,304][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:41:19,097][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:41:19,891][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:41:20,685][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:41:21,477][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:41:22,272][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:41:23,068][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:41:23,861][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:41:24,654][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:41:25,450][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:41:26,245][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:41:27,038][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:41:27,833][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:41:28,630][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:41:29,427][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:41:30,221][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:41:31,014][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:41:31,809][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:41:32,606][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:41:33,399][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:41:34,706][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:41:35,499][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:41:36,292][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:41:37,093][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:41:37,888][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:41:38,685][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:41:39,480][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:41:40,274][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:41:41,068][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:41:42,648][mllm.training.trainer_common][INFO] - For task: Apply reinforce step
,
            ΔVRAM Allocated: 2.288818359375e-05 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:26,
            Percentage of VRAM taken: 69.91579512747062%,
            
[2025-08-20 14:41:43,584][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/policy_optimizer_state.pt
[2025-08-20 14:41:43,586][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/critic_optimizer_state.pt
[2025-08-20 14:41:45,101][__main__][INFO] - Iteration 418 took 55s (38.18% Gen, 61.82% Train). Generation: 21s, Training: 34s. Estimated remaining time: 8h 57m 31s. Estimated total time: 15h 28m 45s. Time estimates for 10 more iterations: 9m 17s, 100 more iterations: 1h 32m 52s, 500 more iterations: 7h 44m 22s.
[2025-08-20 14:41:45,102][__main__][INFO] - Starting iteration 418.
[2025-08-20 14:42:08,029][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 14:42:08,030][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 14:42:08,036][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 14:42:10,486][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 14:42:10,488][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 14:42:10,494][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 14:42:10,496][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data.
[2025-08-20 14:42:10,497][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets.
[2025-08-20 14:42:10,795][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:42:11,584][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:42:12,376][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:42:13,169][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:42:13,961][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:42:14,754][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:42:15,547][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:42:16,339][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:42:17,133][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:42:17,927][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:42:18,719][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:42:19,513][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:42:20,307][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:42:21,100][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:42:21,893][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:42:22,688][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:42:23,481][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:42:24,274][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:42:25,073][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:42:25,870][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:42:26,665][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:42:27,457][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:42:28,249][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:42:29,042][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:42:29,835][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:42:30,629][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:42:31,865][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:42:32,658][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:42:33,452][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:42:34,249][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:42:35,045][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:42:35,840][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:42:37,435][mllm.training.trainer_common][INFO] - For task: Apply reinforce step
,
            ΔVRAM Allocated: 2.288818359375e-05 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:26,
            Percentage of VRAM taken: 69.91579512747062%,
            
[2025-08-20 14:42:38,363][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/policy_optimizer_state.pt
[2025-08-20 14:42:38,365][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/critic_optimizer_state.pt
[2025-08-20 14:42:39,747][__main__][INFO] - Iteration 419 took 54s (37.48% Gen, 62.52% Train). Generation: 20s, Training: 34s. Estimated remaining time: 8h 38m 34s. Estimated total time: 15h 10m 42s. Time estimates for 10 more iterations: 9m 6s, 100 more iterations: 1h 31m 4s, 500 more iterations: 7h 35m 21s.
[2025-08-20 14:42:39,749][__main__][INFO] - Starting iteration 419.
[2025-08-20 14:43:02,632][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 14:43:02,633][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 14:43:02,639][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 14:43:05,088][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 14:43:05,089][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 14:43:05,096][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 14:43:05,099][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data.
[2025-08-20 14:43:05,099][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets.
[2025-08-20 14:43:05,399][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:43:06,189][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:43:06,981][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:43:07,776][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:43:08,570][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:43:09,362][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:43:10,158][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:43:10,954][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:43:11,748][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:43:12,541][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:43:13,337][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:43:14,133][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:43:14,937][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:43:15,733][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:43:16,528][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:43:17,322][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:43:18,118][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:43:18,915][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:43:19,711][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:43:20,505][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:43:21,298][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:43:22,094][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:43:22,890][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:43:24,139][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:43:24,932][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:43:25,727][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:43:26,519][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:43:27,311][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:43:28,105][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:43:28,897][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:43:29,691][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:43:30,484][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:43:32,067][mllm.training.trainer_common][INFO] - For task: Apply reinforce step
,
            ΔVRAM Allocated: 2.288818359375e-05 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:26,
            Percentage of VRAM taken: 69.91579512747062%,
            
[2025-08-20 14:43:32,999][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/policy_optimizer_state.pt
[2025-08-20 14:43:33,000][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/critic_optimizer_state.pt
[2025-08-20 14:43:34,488][__main__][INFO] - Iteration 420 took 54s (37.32% Gen, 62.68% Train). Generation: 20s, Training: 34s. Estimated remaining time: 8h 39m 15s. Estimated total time: 15h 12m 19s. Time estimates for 10 more iterations: 9m 7s, 100 more iterations: 1h 31m 13s, 500 more iterations: 7h 36m 9s.
[2025-08-20 14:43:34,490][__main__][INFO] - Starting iteration 420.
[2025-08-20 14:43:57,413][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 14:43:57,414][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 14:43:57,421][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 14:43:59,899][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 14:43:59,900][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 14:43:59,907][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 14:43:59,909][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data.
[2025-08-20 14:43:59,910][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets.
[2025-08-20 14:44:00,210][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:44:01,001][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:44:01,795][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:44:02,588][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:44:03,382][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:44:04,179][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:44:04,975][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:44:05,768][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:44:06,563][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:44:07,360][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:44:08,156][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:44:08,951][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:44:09,745][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:44:10,542][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:44:11,340][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:44:12,137][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:44:12,933][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:44:13,727][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:44:14,522][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:44:15,320][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:44:16,118][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:44:16,915][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:44:17,713][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:44:19,027][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:44:19,821][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:44:20,616][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:44:21,413][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:44:22,206][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:44:22,999][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:44:23,795][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:44:24,592][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:44:25,386][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:44:26,963][mllm.training.trainer_common][INFO] - For task: Apply reinforce step
,
            ΔVRAM Allocated: 2.288818359375e-05 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:27,
            Percentage of VRAM taken: 69.91579512747062%,
            
[2025-08-20 14:44:27,883][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/policy_optimizer_state.pt
[2025-08-20 14:44:27,884][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/critic_optimizer_state.pt
[2025-08-20 14:44:29,243][__main__][INFO] - Iteration 421 took 54s (37.36% Gen, 62.64% Train). Generation: 20s, Training: 34s. Estimated remaining time: 8h 38m 34s. Estimated total time: 15h 12m 32s. Time estimates for 10 more iterations: 9m 7s, 100 more iterations: 1h 31m 15s, 500 more iterations: 7h 36m 16s.
[2025-08-20 14:44:29,245][__main__][INFO] - Starting iteration 421.
[2025-08-20 14:44:52,542][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 14:44:52,544][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 14:44:52,550][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 14:44:55,008][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 14:44:55,009][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 14:44:55,015][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 14:44:55,018][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data.
[2025-08-20 14:44:55,018][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets.
[2025-08-20 14:44:55,317][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:44:56,108][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:44:56,904][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:44:57,699][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:44:58,491][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:44:59,286][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:45:00,083][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:45:00,878][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:45:01,672][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:45:02,466][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:45:03,263][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:45:04,060][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:45:04,856][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:45:05,650][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:45:06,444][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:45:07,242][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:45:08,476][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:45:09,269][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:45:10,064][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:45:10,861][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:45:11,659][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:45:12,456][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:45:13,251][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:45:14,045][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:45:14,841][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:45:15,638][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:45:16,433][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:45:17,226][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:45:18,020][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:45:18,817][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:45:19,613][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:45:20,406][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:45:22,013][mllm.training.trainer_common][INFO] - For task: Apply reinforce step
,
            ΔVRAM Allocated: 2.288818359375e-05 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:26,
            Percentage of VRAM taken: 69.91579512747062%,
            
[2025-08-20 14:45:23,032][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/policy_optimizer_state.pt
[2025-08-20 14:45:23,034][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/critic_optimizer_state.pt
[2025-08-20 14:45:24,448][__main__][INFO] - Iteration 422 took 55s (37.72% Gen, 62.28% Train). Generation: 20s, Training: 34s. Estimated remaining time: 8h 45m 9s. Estimated total time: 15h 20m 2s. Time estimates for 10 more iterations: 9m 12s, 100 more iterations: 1h 32m 0s, 500 more iterations: 7h 40m 1s.
[2025-08-20 14:45:24,449][__main__][INFO] - Starting iteration 422.
[2025-08-20 14:45:47,439][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 14:45:47,440][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 14:45:47,446][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 14:45:49,913][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 14:45:49,915][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 14:45:49,921][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 14:45:49,923][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data.
[2025-08-20 14:45:49,924][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets.
[2025-08-20 14:45:50,223][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:45:51,013][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:45:51,808][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:45:52,604][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:45:53,396][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:45:54,188][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:45:54,985][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:45:55,778][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:45:56,571][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:45:57,366][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:45:58,162][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:45:58,956][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:45:59,749][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:46:00,544][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:46:01,340][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:46:02,133][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:46:02,928][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:46:03,723][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:46:04,519][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:46:05,312][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:46:06,107][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:46:07,447][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:46:08,244][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:46:09,042][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:46:09,840][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:46:10,634][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:46:11,427][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:46:12,221][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:46:13,018][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:46:13,811][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:46:14,605][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:46:15,401][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:46:16,985][mllm.training.trainer_common][INFO] - For task: Apply reinforce step
,
            ΔVRAM Allocated: 2.288818359375e-05 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:27,
            Percentage of VRAM taken: 69.91579512747062%,
            
[2025-08-20 14:46:17,919][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/policy_optimizer_state.pt
[2025-08-20 14:46:17,920][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/critic_optimizer_state.pt
[2025-08-20 14:46:19,809][__main__][INFO] - Iteration 423 took 55s (37.06% Gen, 62.94% Train). Generation: 20s, Training: 34s. Estimated remaining time: 8h 46m 51s. Estimated total time: 15h 22m 39s. Time estimates for 10 more iterations: 9m 13s, 100 more iterations: 1h 32m 15s, 500 more iterations: 7h 41m 19s.
[2025-08-20 14:46:19,811][__main__][INFO] - Starting iteration 423.
[2025-08-20 14:46:42,849][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 14:46:42,850][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 14:46:42,857][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 14:46:45,322][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 14:46:45,324][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 14:46:45,330][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 14:46:45,332][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data.
[2025-08-20 14:46:45,333][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets.
[2025-08-20 14:46:45,634][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:46:46,427][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:46:47,219][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:46:48,010][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:46:48,807][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:46:49,601][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:46:50,394][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:46:51,189][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:46:51,986][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:46:52,781][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:46:53,575][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:46:54,370][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:46:55,167][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:46:55,963][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:46:56,758][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:46:57,552][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:46:58,348][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:46:59,146][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:46:59,943][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:47:00,737][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:47:01,532][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:47:02,327][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:47:03,715][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:47:04,508][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:47:05,304][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:47:06,099][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:47:06,892][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:47:07,686][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:47:08,482][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:47:09,278][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:47:10,071][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:47:10,865][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:47:12,470][mllm.training.trainer_common][INFO] - For task: Apply reinforce step
,
            ΔVRAM Allocated: 2.288818359375e-05 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:27,
            Percentage of VRAM taken: 69.91579512747062%,
            
[2025-08-20 14:47:13,398][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/policy_optimizer_state.pt
[2025-08-20 14:47:13,400][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/critic_optimizer_state.pt
[2025-08-20 14:47:14,725][__main__][INFO] - Iteration 424 took 54s (37.48% Gen, 62.52% Train). Generation: 20s, Training: 34s. Estimated remaining time: 8h 38m 29s. Estimated total time: 15h 15m 13s. Time estimates for 10 more iterations: 9m 9s, 100 more iterations: 1h 31m 31s, 500 more iterations: 7h 37m 36s.
[2025-08-20 14:47:14,726][__main__][INFO] - Starting iteration 424.
[2025-08-20 14:47:37,863][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 14:47:37,864][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 14:47:37,871][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 14:47:40,340][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 14:47:40,341][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 14:47:40,347][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 14:47:40,350][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data.
[2025-08-20 14:47:40,350][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets.
[2025-08-20 14:47:40,648][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:47:41,438][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:47:42,232][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:47:43,027][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:47:43,820][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:47:44,613][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:47:45,410][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:47:46,205][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:47:46,999][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:47:47,793][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:47:48,591][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:47:49,388][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:47:50,182][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:47:50,976][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:47:51,772][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:47:52,570][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:47:53,367][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:47:54,163][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:47:54,957][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:47:55,752][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:47:56,550][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:47:57,349][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:47:58,147][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:47:58,942][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:48:00,251][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:48:01,046][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:48:01,841][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:48:02,634][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:48:03,428][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:48:04,225][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:48:05,021][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:48:05,815][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:48:07,380][mllm.training.trainer_common][INFO] - For task: Apply reinforce step
,
            ΔVRAM Allocated: 2.288818359375e-05 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:26,
            Percentage of VRAM taken: 69.91579512747062%,
            
[2025-08-20 14:48:08,287][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/policy_optimizer_state.pt
[2025-08-20 14:48:08,288][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/critic_optimizer_state.pt
[2025-08-20 14:48:09,741][__main__][INFO] - Iteration 425 took 55s (37.58% Gen, 62.42% Train). Generation: 20s, Training: 34s. Estimated remaining time: 8h 39m 16s. Estimated total time: 15h 16m 54s. Time estimates for 10 more iterations: 9m 10s, 100 more iterations: 1h 31m 41s, 500 more iterations: 7h 38m 27s.
[2025-08-20 14:48:09,743][__main__][INFO] - Starting iteration 425.
[2025-08-20 14:48:32,681][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 14:48:32,682][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 14:48:32,688][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 14:48:35,145][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 14:48:35,147][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 14:48:35,154][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 14:48:35,156][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data.
[2025-08-20 14:48:35,156][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets.
[2025-08-20 14:48:35,455][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:48:36,248][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:48:37,040][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:48:37,835][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:48:38,631][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:48:39,424][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:48:40,217][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:48:41,011][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:48:41,805][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:48:42,599][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:48:43,395][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:48:44,192][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:48:44,988][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:48:45,781][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:48:46,576][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:48:47,372][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:48:48,169][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:48:49,417][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:48:50,211][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:48:51,007][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:48:51,802][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:48:52,595][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:48:53,390][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:48:54,186][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:48:54,979][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:48:55,772][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:48:56,568][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:48:57,362][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:48:58,156][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:48:58,950][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:48:59,747][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:49:00,542][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:49:02,099][mllm.training.trainer_common][INFO] - For task: Apply reinforce step
,
            ΔVRAM Allocated: 2.288818359375e-05 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:26,
            Percentage of VRAM taken: 69.91579512747062%,
            
[2025-08-20 14:49:03,021][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/policy_optimizer_state.pt
[2025-08-20 14:49:03,023][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/critic_optimizer_state.pt
[2025-08-20 14:49:04,494][__main__][INFO] - Iteration 426 took 54s (37.40% Gen, 62.60% Train). Generation: 20s, Training: 34s. Estimated remaining time: 8h 33m 57s. Estimated total time: 15h 12m 30s. Time estimates for 10 more iterations: 9m 7s, 100 more iterations: 1h 31m 15s, 500 more iterations: 7h 36m 15s.
[2025-08-20 14:49:04,496][__main__][INFO] - Starting iteration 426.
[2025-08-20 14:49:27,378][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 14:49:27,379][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 14:49:27,385][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 14:49:29,833][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 14:49:29,834][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 14:49:29,841][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 14:49:29,843][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data.
[2025-08-20 14:49:29,843][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets.
[2025-08-20 14:49:30,141][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:49:30,935][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:49:31,728][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:49:32,522][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:49:33,319][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:49:34,116][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:49:34,911][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:49:35,704][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:49:36,499][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:49:37,297][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:49:38,094][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:49:38,890][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:49:39,684][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:49:40,479][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:49:41,277][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:49:42,074][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:49:42,871][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:49:43,665][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:49:44,985][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:49:45,780][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:49:46,577][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:49:47,374][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:49:48,170][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:49:48,963][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:49:49,756][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:49:50,552][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:49:51,347][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:49:52,141][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:49:52,935][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:49:53,732][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:49:54,528][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:49:55,322][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:49:56,904][mllm.training.trainer_common][INFO] - For task: Apply reinforce step
,
            ΔVRAM Allocated: 2.288818359375e-05 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:27,
            Percentage of VRAM taken: 69.91579512747062%,
            
[2025-08-20 14:49:57,864][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/policy_optimizer_state.pt
[2025-08-20 14:49:57,866][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/critic_optimizer_state.pt
[2025-08-20 14:49:59,455][__main__][INFO] - Iteration 427 took 54s (37.17% Gen, 62.83% Train). Generation: 20s, Training: 34s. Estimated remaining time: 8h 36m 30s. Estimated total time: 15h 15m 58s. Time estimates for 10 more iterations: 9m 9s, 100 more iterations: 1h 31m 35s, 500 more iterations: 7h 37m 59s.
[2025-08-20 14:49:59,457][__main__][INFO] - Starting iteration 427.
[2025-08-20 14:50:22,623][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 14:50:22,624][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 14:50:22,630][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 14:50:25,090][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 14:50:25,091][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 14:50:25,098][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 14:50:25,100][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data.
[2025-08-20 14:50:25,101][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets.
[2025-08-20 14:50:25,399][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:50:26,191][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:50:26,984][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:50:27,780][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:50:28,576][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:50:29,369][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:50:30,164][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:50:30,960][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:50:31,757][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:50:32,550][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:50:33,345][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:50:34,140][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:50:34,938][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:50:35,735][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:50:36,530][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:50:37,324][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:50:38,120][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:50:38,918][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:50:39,716][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:50:40,512][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:50:41,308][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:50:42,101][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:50:42,896][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:50:43,692][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:50:44,919][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:50:45,711][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:50:46,504][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:50:47,296][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:50:48,090][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:50:48,883][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:50:49,676][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:50:50,472][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:50:52,079][mllm.training.trainer_common][INFO] - For task: Apply reinforce step
,
            ΔVRAM Allocated: 2.288818359375e-05 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:26,
            Percentage of VRAM taken: 69.91579512747062%,
            
[2025-08-20 14:50:53,005][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/policy_optimizer_state.pt
[2025-08-20 14:50:53,006][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/critic_optimizer_state.pt
[2025-08-20 14:50:54,471][__main__][INFO] - Iteration 428 took 55s (37.64% Gen, 62.36% Train). Generation: 20s, Training: 34s. Estimated remaining time: 8h 36m 30s. Estimated total time: 15h 16m 53s. Time estimates for 10 more iterations: 9m 10s, 100 more iterations: 1h 31m 41s, 500 more iterations: 7h 38m 26s.
[2025-08-20 14:50:54,472][__main__][INFO] - Starting iteration 428.
[2025-08-20 14:51:18,298][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 14:51:18,299][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 14:51:18,305][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 14:51:20,775][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 14:51:20,776][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 14:51:20,783][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 14:51:20,785][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data.
[2025-08-20 14:51:20,785][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets.
[2025-08-20 14:51:21,085][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:51:21,877][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:51:22,670][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:51:23,464][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:51:24,259][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:51:25,052][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:51:25,846][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:51:26,642][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:51:27,437][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:51:28,231][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:51:29,024][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:51:29,821][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:51:30,617][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:51:31,411][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:51:32,205][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:51:33,002][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:51:33,800][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:51:34,597][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:51:35,393][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:51:36,186][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:51:36,980][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:51:37,776][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:51:38,997][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:51:39,791][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:51:40,584][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:51:41,379][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:51:42,176][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:51:42,971][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:51:43,765][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:51:44,559][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:51:45,356][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:51:46,152][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:51:47,712][mllm.training.trainer_common][INFO] - For task: Apply reinforce step
,
            ΔVRAM Allocated: 2.288818359375e-05 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:26,
            Percentage of VRAM taken: 69.91579512747062%,
            
[2025-08-20 14:51:48,628][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/policy_optimizer_state.pt
[2025-08-20 14:51:48,629][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/critic_optimizer_state.pt
[2025-08-20 14:51:50,324][__main__][INFO] - Iteration 429 took 55s (38.25% Gen, 61.75% Train). Generation: 21s, Training: 34s. Estimated remaining time: 8h 49m 32s. Estimated total time: 15h 30m 51s. Time estimates for 10 more iterations: 9m 18s, 100 more iterations: 1h 33m 5s, 500 more iterations: 7h 45m 25s.
[2025-08-20 14:51:50,326][__main__][INFO] - Starting iteration 429.
[2025-08-20 14:52:13,248][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 14:52:13,249][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 14:52:13,255][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 14:52:15,722][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 14:52:15,723][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 14:52:15,729][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 14:52:15,732][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data.
[2025-08-20 14:52:15,732][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets.
[2025-08-20 14:52:16,031][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:52:16,824][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:52:17,617][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:52:18,409][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:52:19,206][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:52:20,000][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:52:20,794][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:52:21,589][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:52:22,386][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:52:23,183][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:52:23,979][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:52:24,772][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:52:25,567][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:52:26,365][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:52:27,162][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:52:27,958][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:52:28,752][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:52:29,547][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:52:30,344][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:52:31,142][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:52:31,938][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:52:32,731][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:52:33,525][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:52:34,318][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:52:35,692][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:52:36,484][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:52:37,280][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:52:38,075][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:52:38,868][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:52:39,661][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:52:40,457][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:52:41,375][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:52:42,948][mllm.training.trainer_common][INFO] - For task: Apply reinforce step
,
            ΔVRAM Allocated: 2.288818359375e-05 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:27,
            Percentage of VRAM taken: 69.91579512747062%,
            
[2025-08-20 14:52:43,955][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/policy_optimizer_state.pt
[2025-08-20 14:52:43,957][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/critic_optimizer_state.pt
[2025-08-20 14:52:45,286][__main__][INFO] - Iteration 430 took 54s (37.22% Gen, 62.78% Train). Generation: 20s, Training: 34s. Estimated remaining time: 8h 33m 45s. Estimated total time: 15h 15m 59s. Time estimates for 10 more iterations: 9m 9s, 100 more iterations: 1h 31m 35s, 500 more iterations: 7h 37m 59s.
[2025-08-20 14:52:45,287][__main__][INFO] - Starting iteration 430.
[2025-08-20 14:53:08,064][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 14:53:08,066][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 14:53:08,072][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 14:53:10,519][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 14:53:10,520][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 14:53:10,527][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 14:53:10,529][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data.
[2025-08-20 14:53:10,530][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets.
[2025-08-20 14:53:10,828][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:53:11,619][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:53:12,412][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:53:13,208][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:53:14,003][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:53:14,796][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:53:15,591][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:53:16,387][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:53:17,183][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:53:17,977][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:53:18,771][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:53:19,568][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:53:20,366][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:53:21,162][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:53:21,957][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:53:22,751][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:53:23,548][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:53:24,345][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:53:25,141][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:53:25,934][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:53:26,728][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:53:27,524][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:53:28,818][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:53:29,611][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:53:30,405][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:53:31,203][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:53:31,998][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:53:32,792][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:53:33,586][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:53:34,383][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:53:35,179][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:53:35,973][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:53:37,574][mllm.training.trainer_common][INFO] - For task: Apply reinforce step
,
            ΔVRAM Allocated: 2.288818359375e-05 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:27,
            Percentage of VRAM taken: 69.91579512747062%,
            
[2025-08-20 14:53:38,508][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/policy_optimizer_state.pt
[2025-08-20 14:53:38,510][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/critic_optimizer_state.pt
[2025-08-20 14:53:39,844][__main__][INFO] - Iteration 431 took 54s (37.29% Gen, 62.71% Train). Generation: 20s, Training: 34s. Estimated remaining time: 8h 26m 8s. Estimated total time: 15h 9m 16s. Time estimates for 10 more iterations: 9m 5s, 100 more iterations: 1h 30m 55s, 500 more iterations: 7h 34m 38s.
[2025-08-20 14:53:39,846][__main__][INFO] - Starting iteration 431.
[2025-08-20 14:54:02,757][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 14:54:02,758][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 14:54:02,764][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 14:54:05,234][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 14:54:05,235][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 14:54:05,241][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 14:54:05,243][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data.
[2025-08-20 14:54:05,244][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets.
[2025-08-20 14:54:05,544][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:54:06,335][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:54:07,132][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:54:07,927][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:54:08,720][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:54:09,515][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:54:10,311][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:54:11,107][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:54:11,901][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:54:12,694][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:54:13,491][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:54:14,287][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:54:15,082][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:54:15,876][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:54:16,671][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:54:17,466][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:54:18,261][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:54:19,055][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:54:19,851][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:54:21,080][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:54:21,873][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:54:22,668][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:54:23,464][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:54:24,258][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:54:25,051][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:54:25,847][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:54:26,644][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:54:27,438][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:54:28,232][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:54:29,027][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:54:29,825][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:54:30,622][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:54:32,180][mllm.training.trainer_common][INFO] - For task: Apply reinforce step
,
            ΔVRAM Allocated: 2.288818359375e-05 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:26,
            Percentage of VRAM taken: 69.91579512747062%,
            
[2025-08-20 14:54:33,183][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/policy_optimizer_state.pt
[2025-08-20 14:54:33,185][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/critic_optimizer_state.pt
[2025-08-20 14:54:34,579][__main__][INFO] - Iteration 432 took 54s (37.36% Gen, 62.64% Train). Generation: 20s, Training: 34s. Estimated remaining time: 8h 28m 8s. Estimated total time: 15h 12m 12s. Time estimates for 10 more iterations: 9m 7s, 100 more iterations: 1h 31m 13s, 500 more iterations: 7h 36m 6s.
[2025-08-20 14:54:34,580][__main__][INFO] - Starting iteration 432.
[2025-08-20 14:54:57,807][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 14:54:57,808][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 14:54:57,814][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 14:55:00,319][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 14:55:00,321][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 14:55:00,327][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 14:55:00,330][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data.
[2025-08-20 14:55:00,330][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets.
[2025-08-20 14:55:00,630][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:55:01,421][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:55:02,216][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:55:03,013][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:55:03,808][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:55:04,603][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:55:05,397][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:55:06,194][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:55:06,992][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:55:07,787][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:55:08,582][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:55:09,376][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:55:10,175][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:55:10,973][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:55:11,770][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:55:12,565][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:55:13,361][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:55:14,156][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:55:14,952][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:55:15,748][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:55:16,542][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:55:17,335][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:55:18,131][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:55:18,928][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:55:19,721][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:55:21,007][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:55:21,802][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:55:22,596][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:55:23,390][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:55:24,188][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:55:24,984][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:55:25,778][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:55:27,359][mllm.training.trainer_common][INFO] - For task: Apply reinforce step
,
            ΔVRAM Allocated: 2.288818359375e-05 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:26,
            Percentage of VRAM taken: 69.91579512747062%,
            
[2025-08-20 14:55:28,299][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/policy_optimizer_state.pt
[2025-08-20 14:55:28,301][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/critic_optimizer_state.pt
[2025-08-20 14:55:29,931][__main__][INFO] - Iteration 433 took 55s (37.49% Gen, 62.51% Train). Generation: 20s, Training: 34s. Estimated remaining time: 8h 37m 31s. Estimated total time: 15h 22m 30s. Time estimates for 10 more iterations: 9m 13s, 100 more iterations: 1h 32m 15s, 500 more iterations: 7h 41m 15s.
[2025-08-20 14:55:29,933][__main__][INFO] - Starting iteration 433.
[2025-08-20 14:55:52,952][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 14:55:52,953][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 14:55:52,960][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 14:55:55,398][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 14:55:55,399][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 14:55:55,406][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 14:55:55,408][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data.
[2025-08-20 14:55:55,409][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets.
[2025-08-20 14:55:55,712][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:55:56,502][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:55:57,297][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:55:58,091][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:55:58,885][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:55:59,679][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:56:00,476][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:56:01,272][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:56:02,065][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:56:02,859][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:56:03,656][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:56:04,452][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:56:05,246][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:56:06,039][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:56:06,836][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:56:07,634][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:56:08,430][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:56:09,223][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:56:10,016][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:56:10,812][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:56:11,607][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:56:12,399][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:56:13,194][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:56:14,402][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:56:15,195][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:56:15,992][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:56:16,787][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:56:17,581][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:56:18,374][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:56:19,172][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:56:19,968][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:56:20,762][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:56:22,319][mllm.training.trainer_common][INFO] - For task: Apply reinforce step
,
            ΔVRAM Allocated: 2.288818359375e-05 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:26,
            Percentage of VRAM taken: 69.91579512747062%,
            
[2025-08-20 14:56:23,268][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/policy_optimizer_state.pt
[2025-08-20 14:56:23,269][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/critic_optimizer_state.pt
[2025-08-20 14:56:24,594][__main__][INFO] - Iteration 434 took 54s (37.65% Gen, 62.35% Train). Generation: 20s, Training: 34s. Estimated remaining time: 8h 25m 7s. Estimated total time: 15h 11m 1s. Time estimates for 10 more iterations: 9m 6s, 100 more iterations: 1h 31m 6s, 500 more iterations: 7h 35m 30s.
[2025-08-20 14:56:24,596][__main__][INFO] - Starting iteration 434.
[2025-08-20 14:56:47,460][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 14:56:47,461][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 14:56:47,468][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 14:56:49,947][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 14:56:49,948][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 14:56:49,955][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 14:56:49,958][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data.
[2025-08-20 14:56:49,958][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets.
[2025-08-20 14:56:50,257][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:56:51,048][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:56:51,842][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:56:52,638][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:56:53,433][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:56:54,227][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:56:55,022][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:56:55,819][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:56:56,614][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:56:57,408][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:56:58,202][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:56:58,999][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:56:59,797][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:57:00,593][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:57:01,387][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:57:02,182][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:57:02,979][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:57:03,776][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:57:04,571][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:57:05,363][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:57:06,617][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:57:07,410][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:57:08,202][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:57:08,996][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:57:09,792][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:57:10,585][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:57:11,377][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:57:12,174][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:57:12,968][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:57:13,761][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:57:14,556][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:57:15,349][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:57:16,927][mllm.training.trainer_common][INFO] - For task: Apply reinforce step
,
            ΔVRAM Allocated: 2.288818359375e-05 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:26,
            Percentage of VRAM taken: 69.91579512747062%,
            
[2025-08-20 14:57:17,865][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/policy_optimizer_state.pt
[2025-08-20 14:57:17,866][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/critic_optimizer_state.pt
[2025-08-20 14:57:19,262][__main__][INFO] - Iteration 435 took 54s (37.32% Gen, 62.68% Train). Generation: 20s, Training: 34s. Estimated remaining time: 8h 24m 17s. Estimated total time: 15h 11m 6s. Time estimates for 10 more iterations: 9m 6s, 100 more iterations: 1h 31m 6s, 500 more iterations: 7h 35m 33s.
[2025-08-20 14:57:19,264][__main__][INFO] - Starting iteration 435.
[2025-08-20 14:57:42,248][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 14:57:42,249][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 14:57:42,256][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 14:57:44,713][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 14:57:44,714][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 14:57:44,721][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 14:57:44,723][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data.
[2025-08-20 14:57:44,723][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets.
[2025-08-20 14:57:45,021][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:57:45,810][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:57:46,602][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:57:47,395][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:57:48,188][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:57:48,983][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:57:49,777][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:57:50,570][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:57:51,364][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:57:52,161][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:57:52,954][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:57:53,747][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:57:54,542][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:57:55,338][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:57:56,132][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:57:56,925][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:57:57,721][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:57:59,029][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:57:59,821][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:58:00,617][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:58:01,409][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:58:02,201][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:58:02,997][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:58:03,790][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:58:04,583][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:58:05,379][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:58:06,176][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:58:06,969][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:58:07,762][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:58:08,557][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:58:09,353][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:58:10,146][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:58:11,758][mllm.training.trainer_common][INFO] - For task: Apply reinforce step
,
            ΔVRAM Allocated: 2.288818359375e-05 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:27,
            Percentage of VRAM taken: 69.91579512747062%,
            
[2025-08-20 14:58:12,725][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/policy_optimizer_state.pt
[2025-08-20 14:58:12,726][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/critic_optimizer_state.pt
[2025-08-20 14:58:14,071][__main__][INFO] - Iteration 436 took 54s (37.48% Gen, 62.52% Train). Generation: 20s, Training: 34s. Estimated remaining time: 8h 25m 44s. Estimated total time: 15h 13m 27s. Time estimates for 10 more iterations: 9m 8s, 100 more iterations: 1h 31m 20s, 500 more iterations: 7h 36m 43s.
[2025-08-20 14:58:14,073][__main__][INFO] - Starting iteration 436.
[2025-08-20 14:58:37,096][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 14:58:37,098][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 14:58:37,104][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 14:58:39,558][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 14:58:39,559][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 14:58:39,566][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 14:58:39,568][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data.
[2025-08-20 14:58:39,569][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets.
[2025-08-20 14:58:39,867][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:58:40,660][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:58:41,453][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:58:42,247][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:58:43,044][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:58:43,838][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:58:44,632][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:58:45,427][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:58:46,225][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:58:47,021][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:58:47,816][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:58:48,610][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:58:49,405][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:58:50,202][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:58:50,998][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:58:51,792][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:58:52,586][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:58:53,382][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:58:54,176][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:58:54,969][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:58:56,172][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:58:56,965][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:58:57,760][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:58:58,554][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:58:59,347][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:59:00,141][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:59:00,937][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:59:01,730][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:59:02,522][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:59:03,319][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:59:04,114][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:59:04,909][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:59:06,474][mllm.training.trainer_common][INFO] - For task: Apply reinforce step
,
            ΔVRAM Allocated: 2.288818359375e-05 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:26,
            Percentage of VRAM taken: 69.91579512747062%,
            
[2025-08-20 14:59:07,457][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/policy_optimizer_state.pt
[2025-08-20 14:59:07,458][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/critic_optimizer_state.pt
[2025-08-20 14:59:08,991][__main__][INFO] - Iteration 437 took 54s (37.45% Gen, 62.55% Train). Generation: 20s, Training: 34s. Estimated remaining time: 8h 26m 39s. Estimated total time: 15h 15m 17s. Time estimates for 10 more iterations: 9m 9s, 100 more iterations: 1h 31m 31s, 500 more iterations: 7h 37m 38s.
[2025-08-20 14:59:08,992][__main__][INFO] - Starting iteration 437.
[2025-08-20 14:59:32,161][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 14:59:32,162][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 14:59:32,168][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 14:59:34,640][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 14:59:34,642][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 14:59:34,648][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 14:59:34,650][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data.
[2025-08-20 14:59:34,651][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets.
[2025-08-20 14:59:34,950][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:59:35,745][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:59:36,538][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:59:37,331][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:59:38,128][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:59:38,922][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:59:39,715][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:59:40,510][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:59:41,307][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:59:42,103][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:59:42,898][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:59:43,693][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:59:44,488][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:59:45,285][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:59:46,082][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:59:46,876][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:59:47,668][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:59:48,463][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:59:49,257][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:59:50,050][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:59:50,843][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:59:51,638][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:59:52,431][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:59:53,225][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:59:54,022][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:59:54,817][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:59:56,069][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:59:56,863][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:59:57,660][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:59:58,454][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 14:59:59,248][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:00:00,043][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:00:01,618][mllm.training.trainer_common][INFO] - For task: Apply reinforce step
,
            ΔVRAM Allocated: 2.288818359375e-05 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:26,
            Percentage of VRAM taken: 69.91579512747062%,
            
[2025-08-20 15:00:02,622][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/policy_optimizer_state.pt
[2025-08-20 15:00:02,624][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/critic_optimizer_state.pt
[2025-08-20 15:00:04,564][__main__][INFO] - Iteration 438 took 55s (37.26% Gen, 62.74% Train). Generation: 20s, Training: 34s. Estimated remaining time: 8h 36m 37s. Estimated total time: 15h 26m 11s. Time estimates for 10 more iterations: 9m 15s, 100 more iterations: 1h 32m 37s, 500 more iterations: 7h 43m 5s.
[2025-08-20 15:00:04,565][__main__][INFO] - Starting iteration 438.
[2025-08-20 15:00:27,633][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 15:00:27,635][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 15:00:27,641][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 15:00:30,061][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 15:00:30,063][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 15:00:30,069][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 15:00:30,071][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data.
[2025-08-20 15:00:30,072][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets.
[2025-08-20 15:00:30,370][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:00:31,160][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:00:31,952][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:00:32,747][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:00:33,540][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:00:34,332][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:00:35,129][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:00:35,924][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:00:36,717][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:00:37,511][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:00:38,308][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:00:39,103][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:00:39,898][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:00:40,691][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:00:41,488][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:00:42,283][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:00:43,075][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:00:43,868][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:00:44,663][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:00:45,456][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:00:46,249][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:00:47,046][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:00:48,322][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:00:49,115][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:00:49,908][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:00:50,704][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:00:51,499][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:00:52,291][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:00:53,085][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:00:53,882][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:00:54,675][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:00:55,468][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:00:57,052][mllm.training.trainer_common][INFO] - For task: Apply reinforce step
,
            ΔVRAM Allocated: 2.288818359375e-05 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:26,
            Percentage of VRAM taken: 69.91579512747062%,
            
[2025-08-20 15:00:58,016][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/policy_optimizer_state.pt
[2025-08-20 15:00:58,017][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/critic_optimizer_state.pt
[2025-08-20 15:00:59,364][__main__][INFO] - Iteration 439 took 54s (37.65% Gen, 62.34% Train). Generation: 20s, Training: 34s. Estimated remaining time: 8h 22m 49s. Estimated total time: 15h 13m 18s. Time estimates for 10 more iterations: 9m 7s, 100 more iterations: 1h 31m 19s, 500 more iterations: 7h 36m 39s.
[2025-08-20 15:00:59,365][__main__][INFO] - Starting iteration 439.
[2025-08-20 15:01:22,333][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 15:01:22,334][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 15:01:22,341][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 15:01:24,788][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 15:01:24,789][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 15:01:24,796][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 15:01:24,798][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data.
[2025-08-20 15:01:24,799][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets.
[2025-08-20 15:01:25,097][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:01:25,894][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:01:26,688][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:01:27,481][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:01:28,276][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:01:29,073][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:01:29,870][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:01:30,665][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:01:31,460][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:01:32,255][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:01:33,053][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:01:33,850][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:01:34,647][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:01:35,441][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:01:36,236][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:01:37,032][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:01:38,235][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:01:39,028][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:01:39,821][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:01:40,614][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:01:41,408][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:01:42,204][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:01:42,997][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:01:43,792][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:01:44,589][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:01:45,385][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:01:46,180][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:01:46,974][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:01:47,769][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:01:48,568][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:01:49,365][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:01:50,161][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:01:51,720][mllm.training.trainer_common][INFO] - For task: Apply reinforce step
,
            ΔVRAM Allocated: 2.288818359375e-05 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:26,
            Percentage of VRAM taken: 69.91579512747062%,
            
[2025-08-20 15:01:52,680][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/policy_optimizer_state.pt
[2025-08-20 15:01:52,681][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/critic_optimizer_state.pt
[2025-08-20 15:01:54,085][__main__][INFO] - Iteration 440 took 54s (37.50% Gen, 62.50% Train). Generation: 20s, Training: 34s. Estimated remaining time: 8h 20m 36s. Estimated total time: 15h 11m 59s. Time estimates for 10 more iterations: 9m 7s, 100 more iterations: 1h 31m 11s, 500 more iterations: 7h 35m 59s.
[2025-08-20 15:01:54,087][__main__][INFO] - Starting iteration 440.
[2025-08-20 15:02:17,018][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 15:02:17,019][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 15:02:17,025][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 15:02:19,481][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 15:02:19,483][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 15:02:19,489][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 15:02:19,491][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data.
[2025-08-20 15:02:19,492][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets.
[2025-08-20 15:02:19,793][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:02:20,585][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:02:21,379][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:02:22,176][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:02:22,972][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:02:23,766][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:02:24,560][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:02:25,357][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:02:26,154][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:02:26,949][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:02:27,742][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:02:28,537][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:02:29,335][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:02:30,133][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:02:30,930][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:02:31,723][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:02:32,516][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:02:33,312][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:02:34,108][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:02:34,900][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:02:35,694][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:02:36,970][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:02:37,765][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:02:38,559][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:02:39,354][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:02:40,151][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:02:40,947][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:02:41,741][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:02:42,534][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:02:43,331][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:02:44,128][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:02:44,924][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:02:46,513][mllm.training.trainer_common][INFO] - For task: Apply reinforce step
,
            ΔVRAM Allocated: 2.288818359375e-05 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:26,
            Percentage of VRAM taken: 69.91579512747062%,
            
[2025-08-20 15:02:47,456][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/policy_optimizer_state.pt
[2025-08-20 15:02:47,457][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/critic_optimizer_state.pt
[2025-08-20 15:02:49,032][__main__][INFO] - Iteration 441 took 54s (37.25% Gen, 62.74% Train). Generation: 20s, Training: 34s. Estimated remaining time: 8h 23m 26s. Estimated total time: 15h 15m 44s. Time estimates for 10 more iterations: 9m 9s, 100 more iterations: 1h 31m 34s, 500 more iterations: 7h 37m 52s.
[2025-08-20 15:02:49,033][__main__][INFO] - Starting iteration 441.
[2025-08-20 15:03:12,098][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 15:03:12,099][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 15:03:12,106][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 15:03:14,542][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 15:03:14,543][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 15:03:14,550][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 15:03:14,552][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data.
[2025-08-20 15:03:14,553][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets.
[2025-08-20 15:03:14,852][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:03:15,642][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:03:16,438][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:03:17,233][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:03:18,025][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:03:18,820][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:03:19,617][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:03:20,411][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:03:21,205][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:03:22,000][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:03:22,797][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:03:23,594][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:03:24,388][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:03:25,183][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:03:25,978][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:03:26,775][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:03:27,568][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:03:28,361][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:03:29,157][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:03:29,953][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:03:30,745][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:03:31,973][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:03:32,766][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:03:33,559][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:03:34,352][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:03:35,146][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:03:35,939][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:03:36,735][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:03:37,531][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:03:38,326][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:03:39,120][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:03:39,915][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:03:41,522][mllm.training.trainer_common][INFO] - For task: Apply reinforce step
,
            ΔVRAM Allocated: 2.288818359375e-05 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:26,
            Percentage of VRAM taken: 69.91579512747062%,
            
[2025-08-20 15:03:42,490][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/policy_optimizer_state.pt
[2025-08-20 15:03:42,491][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/critic_optimizer_state.pt
[2025-08-20 15:03:43,828][__main__][INFO] - Iteration 442 took 54s (37.62% Gen, 62.38% Train). Generation: 20s, Training: 34s. Estimated remaining time: 8h 20m 1s. Estimated total time: 15h 13m 14s. Time estimates for 10 more iterations: 9m 7s, 100 more iterations: 1h 31m 19s, 500 more iterations: 7h 36m 37s.
[2025-08-20 15:03:43,830][__main__][INFO] - Starting iteration 442.
[2025-08-20 15:04:07,183][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 15:04:07,184][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 15:04:07,190][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 15:04:09,610][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 15:04:09,612][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 15:04:09,618][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 15:04:09,621][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data.
[2025-08-20 15:04:09,621][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets.
[2025-08-20 15:04:09,926][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:04:10,719][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:04:11,512][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:04:12,304][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:04:13,102][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:04:13,896][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:04:14,690][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:04:15,484][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:04:16,282][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:04:17,078][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:04:17,872][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:04:18,666][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:04:19,462][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:04:20,259][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:04:21,053][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:04:21,845][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:04:22,638][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:04:23,432][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:04:24,224][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:04:25,019][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:04:25,815][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:04:26,609][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:04:27,904][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:04:28,699][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:04:29,496][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:04:30,292][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:04:31,086][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:04:31,880][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:04:32,676][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:04:33,472][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:04:34,266][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:04:35,059][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:04:36,633][mllm.training.trainer_common][INFO] - For task: Apply reinforce step
,
            ΔVRAM Allocated: 2.288818359375e-05 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:26,
            Percentage of VRAM taken: 69.91579512747062%,
            
[2025-08-20 15:04:37,599][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/policy_optimizer_state.pt
[2025-08-20 15:04:37,600][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/critic_optimizer_state.pt
[2025-08-20 15:04:39,009][__main__][INFO] - Iteration 443 took 55s (37.92% Gen, 62.08% Train). Generation: 20s, Training: 34s. Estimated remaining time: 8h 25m 30s. Estimated total time: 15h 19m 38s. Time estimates for 10 more iterations: 9m 11s, 100 more iterations: 1h 31m 57s, 500 more iterations: 7h 39m 49s.
[2025-08-20 15:04:39,015][__main__][INFO] - Starting iteration 443.
[2025-08-20 15:05:02,051][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 15:05:02,052][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 15:05:02,059][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 15:05:04,495][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 15:05:04,496][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 15:05:04,503][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 15:05:04,505][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data.
[2025-08-20 15:05:04,506][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets.
[2025-08-20 15:05:04,806][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:05:05,598][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:05:06,390][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:05:07,186][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:05:07,981][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:05:08,775][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:05:09,568][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:05:10,364][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:05:11,161][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:05:11,954][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:05:12,748][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:05:13,544][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:05:14,342][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:05:15,136][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:05:15,929][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:05:16,723][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:05:17,517][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:05:18,310][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:05:19,104][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:05:19,900][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:05:20,696][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:05:21,489][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:05:22,282][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:05:23,573][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:05:24,367][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:05:25,162][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:05:25,958][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:05:26,753][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:05:27,547][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:05:28,342][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:05:29,139][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:05:29,936][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:05:31,540][mllm.training.trainer_common][INFO] - For task: Apply reinforce step
,
            ΔVRAM Allocated: 2.288818359375e-05 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:26,
            Percentage of VRAM taken: 69.91579512747062%,
            
[2025-08-20 15:05:32,482][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/policy_optimizer_state.pt
[2025-08-20 15:05:32,483][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/critic_optimizer_state.pt
[2025-08-20 15:05:34,443][__main__][INFO] - Iteration 444 took 55s (37.18% Gen, 62.82% Train). Generation: 20s, Training: 34s. Estimated remaining time: 8h 28m 39s. Estimated total time: 15h 23m 42s. Time estimates for 10 more iterations: 9m 14s, 100 more iterations: 1h 32m 22s, 500 more iterations: 7h 41m 51s.
[2025-08-20 15:05:34,445][__main__][INFO] - Starting iteration 444.
[2025-08-20 15:05:57,414][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 15:05:57,416][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 15:05:57,422][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 15:05:59,892][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 15:05:59,893][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 15:05:59,900][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 15:05:59,903][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data.
[2025-08-20 15:05:59,903][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets.
[2025-08-20 15:06:00,202][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:06:00,993][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:06:01,787][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:06:02,582][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:06:03,375][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:06:04,169][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:06:04,966][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:06:05,762][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:06:06,556][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:06:07,350][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:06:08,147][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:06:08,944][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:06:09,739][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:06:10,533][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:06:11,327][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:06:12,123][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:06:12,916][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:06:13,709][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:06:14,504][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:06:15,299][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:06:16,093][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:06:16,886][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:06:17,680][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:06:18,947][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:06:19,743][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:06:20,539][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:06:21,332][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:06:22,126][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:06:22,922][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:06:23,718][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:06:24,512][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:06:25,305][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:06:26,956][mllm.training.trainer_common][INFO] - For task: Apply reinforce step
,
            ΔVRAM Allocated: 2.288818359375e-05 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:27,
            Percentage of VRAM taken: 69.91579512747062%,
            
[2025-08-20 15:06:27,915][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/policy_optimizer_state.pt
[2025-08-20 15:06:27,916][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/critic_optimizer_state.pt
[2025-08-20 15:06:29,370][__main__][INFO] - Iteration 445 took 54s (37.31% Gen, 62.69% Train). Generation: 20s, Training: 34s. Estimated remaining time: 8h 19m 26s. Estimated total time: 15h 15m 25s. Time estimates for 10 more iterations: 9m 9s, 100 more iterations: 1h 31m 32s, 500 more iterations: 7h 37m 42s.
[2025-08-20 15:06:29,372][__main__][INFO] - Starting iteration 445.
[2025-08-20 15:06:52,958][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 15:06:52,959][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 15:06:52,966][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 15:06:55,446][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 15:06:55,447][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 15:06:55,453][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 15:06:55,456][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data.
[2025-08-20 15:06:55,456][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets.
[2025-08-20 15:06:55,755][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:06:56,547][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:06:57,338][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:06:58,130][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:06:58,923][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:06:59,714][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:07:00,507][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:07:01,298][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:07:02,091][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:07:02,887][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:07:03,682][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:07:04,474][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:07:05,269][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:07:06,064][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:07:06,858][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:07:07,650][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:07:08,445][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:07:09,237][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:07:10,028][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:07:10,824][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:07:11,616][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:07:12,924][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:07:13,716][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:07:14,508][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:07:15,304][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:07:16,098][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:07:16,890][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:07:17,685][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:07:18,480][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:07:19,274][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:07:20,067][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:07:20,863][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:07:22,426][mllm.training.trainer_common][INFO] - For task: Apply reinforce step
,
            ΔVRAM Allocated: 2.288818359375e-05 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:26,
            Percentage of VRAM taken: 69.91579512747062%,
            
[2025-08-20 15:07:23,363][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/policy_optimizer_state.pt
[2025-08-20 15:07:23,364][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/critic_optimizer_state.pt
[2025-08-20 15:07:24,718][__main__][INFO] - Iteration 446 took 55s (38.14% Gen, 61.86% Train). Generation: 21s, Training: 34s. Estimated remaining time: 8h 25m 31s. Estimated total time: 15h 22m 25s. Time estimates for 10 more iterations: 9m 13s, 100 more iterations: 1h 32m 14s, 500 more iterations: 7h 41m 12s.
[2025-08-20 15:07:24,719][__main__][INFO] - Starting iteration 446.
[2025-08-20 15:07:47,631][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 15:07:47,632][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 15:07:47,639][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 15:07:50,104][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 15:07:50,105][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 15:07:50,112][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 15:07:50,114][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data.
[2025-08-20 15:07:50,115][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets.
[2025-08-20 15:07:50,415][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:07:51,205][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:07:51,995][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:07:52,788][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:07:53,580][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:07:54,372][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:07:55,165][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:07:55,956][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:07:56,751][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:07:57,544][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:07:58,336][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:07:59,131][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:07:59,925][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:08:00,718][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:08:01,512][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:08:02,306][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:08:03,097][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:08:04,352][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:08:05,146][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:08:05,939][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:08:06,732][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:08:07,529][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:08:08,325][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:08:09,120][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:08:09,913][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:08:10,709][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:08:11,506][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:08:12,301][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:08:13,095][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:08:13,890][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:08:14,685][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:08:15,480][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:08:17,072][mllm.training.trainer_common][INFO] - For task: Apply reinforce step
,
            ΔVRAM Allocated: 2.288818359375e-05 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:26,
            Percentage of VRAM taken: 69.91579512747062%,
            
[2025-08-20 15:08:18,027][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/policy_optimizer_state.pt
[2025-08-20 15:08:18,029][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/critic_optimizer_state.pt
[2025-08-20 15:08:19,644][__main__][INFO] - Iteration 447 took 54s (37.23% Gen, 62.77% Train). Generation: 20s, Training: 34s. Estimated remaining time: 8h 17m 36s. Estimated total time: 15h 15m 24s. Time estimates for 10 more iterations: 9m 9s, 100 more iterations: 1h 31m 32s, 500 more iterations: 7h 37m 42s.
[2025-08-20 15:08:19,646][__main__][INFO] - Starting iteration 447.
[2025-08-20 15:08:42,900][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 15:08:42,901][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 15:08:42,908][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 15:08:45,345][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 15:08:45,346][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 15:08:45,353][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 15:08:45,355][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data.
[2025-08-20 15:08:45,356][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets.
[2025-08-20 15:08:45,654][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:08:46,446][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:08:47,239][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:08:48,036][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:08:48,833][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:08:49,626][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:08:50,420][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:08:51,216][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:08:52,013][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:08:52,807][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:08:53,602][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:08:54,397][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:08:55,194][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:08:55,991][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:08:56,785][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:08:57,578][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:08:58,373][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:08:59,169][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:09:00,369][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:09:01,162][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:09:01,956][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:09:02,753][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:09:03,549][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:09:04,342][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:09:05,136][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:09:05,932][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:09:06,730][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:09:07,526][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:09:08,319][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:09:09,113][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:09:09,909][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:09:10,705][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:09:12,322][mllm.training.trainer_common][INFO] - For task: Apply reinforce step
,
            ΔVRAM Allocated: 2.288818359375e-05 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:26,
            Percentage of VRAM taken: 69.91579512747062%,
            
[2025-08-20 15:09:13,333][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/policy_optimizer_state.pt
[2025-08-20 15:09:13,336][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/critic_optimizer_state.pt
[2025-08-20 15:09:14,716][__main__][INFO] - Iteration 448 took 55s (37.78% Gen, 62.22% Train). Generation: 20s, Training: 34s. Estimated remaining time: 8h 19m 5s. Estimated total time: 15h 17m 49s. Time estimates for 10 more iterations: 9m 10s, 100 more iterations: 1h 31m 46s, 500 more iterations: 7h 38m 54s.
[2025-08-20 15:09:14,717][__main__][INFO] - Starting iteration 448.
[2025-08-20 15:09:37,741][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 15:09:37,742][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 15:09:37,748][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 15:09:40,207][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 15:09:40,209][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 15:09:40,215][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 15:09:40,218][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data.
[2025-08-20 15:09:40,218][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets.
[2025-08-20 15:09:40,517][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:09:41,309][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:09:42,101][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:09:42,897][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:09:43,693][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:09:44,486][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:09:45,280][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:09:46,077][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:09:46,872][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:09:47,665][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:09:48,460][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:09:49,258][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:09:50,054][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:09:50,848][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:09:51,642][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:09:52,436][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:09:53,231][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:09:54,024][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:09:54,817][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:09:55,614][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:09:56,408][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:09:57,202][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:09:58,434][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:09:59,226][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:10:00,018][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:10:00,812][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:10:01,605][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:10:02,396][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:10:03,191][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:10:03,984][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:10:04,776][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:10:05,571][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:10:07,133][mllm.training.trainer_common][INFO] - For task: Apply reinforce step
,
            ΔVRAM Allocated: 2.288818359375e-05 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:26,
            Percentage of VRAM taken: 69.91579512747062%,
            
[2025-08-20 15:10:08,039][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/policy_optimizer_state.pt
[2025-08-20 15:10:08,041][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/critic_optimizer_state.pt
[2025-08-20 15:10:11,584][__main__][INFO] - Iteration 449 took 56s (36.19% Gen, 63.81% Train). Generation: 20s, Training: 36s. Estimated remaining time: 8h 48m 5s. Estimated total time: 15h 47m 46s. Time estimates for 10 more iterations: 9m 28s, 100 more iterations: 1h 34m 46s, 500 more iterations: 7h 53m 53s.
[2025-08-20 15:10:11,585][__main__][INFO] - Starting iteration 449.
[2025-08-20 15:10:34,578][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 15:10:34,579][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 15:10:34,585][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 15:10:37,067][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 15:10:37,068][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 15:10:37,075][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 15:10:37,077][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data.
[2025-08-20 15:10:37,078][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets.
[2025-08-20 15:10:37,375][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:10:38,164][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:10:38,957][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:10:39,748][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:10:40,540][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:10:41,332][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:10:42,124][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:10:42,918][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:10:43,712][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:10:44,504][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:10:45,299][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:10:46,093][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:10:46,885][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:10:47,679][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:10:48,473][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:10:49,265][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:10:50,059][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:10:50,852][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:10:51,643][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:10:52,436][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:10:53,228][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:10:54,020][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:10:54,815][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:10:56,099][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:10:56,894][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:10:57,688][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:10:58,481][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:10:59,275][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:11:00,070][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:11:00,862][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:11:01,656][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:11:02,452][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:11:04,043][mllm.training.trainer_common][INFO] - For task: Apply reinforce step
,
            ΔVRAM Allocated: 2.288818359375e-05 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:26,
            Percentage of VRAM taken: 69.91579512747062%,
            
[2025-08-20 15:11:04,972][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/policy_optimizer_state.pt
[2025-08-20 15:11:04,973][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/critic_optimizer_state.pt
[2025-08-20 15:11:06,328][__main__][INFO] - Iteration 450 took 54s (37.50% Gen, 62.50% Train). Generation: 20s, Training: 34s. Estimated remaining time: 8h 11m 46s. Estimated total time: 15h 12m 22s. Time estimates for 10 more iterations: 9m 7s, 100 more iterations: 1h 31m 14s, 500 more iterations: 7h 36m 11s.
[2025-08-20 15:11:06,329][__main__][INFO] - Starting iteration 450.
[2025-08-20 15:11:29,277][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 15:11:29,278][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 15:11:29,284][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 15:11:31,743][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 15:11:31,745][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 15:11:31,752][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 15:11:31,754][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data.
[2025-08-20 15:11:31,754][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets.
[2025-08-20 15:11:32,052][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:11:32,842][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:11:33,633][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:11:34,425][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:11:35,219][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:11:36,011][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:11:36,804][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:11:37,599][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:11:38,391][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:11:39,184][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:11:39,981][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:11:40,774][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:11:41,566][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:11:42,361][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:11:43,156][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:11:43,949][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:11:44,743][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:11:45,536][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:11:46,327][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:11:47,120][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:11:47,912][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:11:48,703][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:11:49,910][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:11:50,703][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:11:51,499][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:11:52,294][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:11:53,086][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:11:53,880][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:11:54,676][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:11:55,470][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:11:56,263][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:11:57,058][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:11:58,670][mllm.training.trainer_common][INFO] - For task: Apply reinforce step
,
            ΔVRAM Allocated: 2.288818359375e-05 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:26,
            Percentage of VRAM taken: 69.91579512747062%,
            
[2025-08-20 15:11:59,607][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/policy_optimizer_state.pt
[2025-08-20 15:11:59,608][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/critic_optimizer_state.pt
[2025-08-20 15:12:03,874][__main__][INFO] - Iteration 451 took 57s (35.56% Gen, 59.35% Train). Generation: 20s, Training: 34s. Estimated remaining time: 8h 57m 31s. Estimated total time: 15h 59m 3s. Time estimates for 10 more iterations: 9m 35s, 100 more iterations: 1h 35m 54s, 500 more iterations: 7h 59m 31s.
[2025-08-20 15:12:03,875][__main__][INFO] - Starting iteration 451.
[2025-08-20 15:12:27,356][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 15:12:27,358][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 15:12:27,364][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 15:12:29,813][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 15:12:29,814][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 15:12:29,821][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 15:12:29,823][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data.
[2025-08-20 15:12:29,824][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets.
[2025-08-20 15:12:30,123][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:12:30,914][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:12:31,708][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:12:32,504][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:12:33,298][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:12:34,090][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:12:34,886][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:12:35,680][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:12:36,474][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:12:37,267][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:12:38,064][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:12:38,858][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:12:39,652][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:12:40,446][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:12:41,243][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:12:42,036][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:12:42,828][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:12:43,623][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:12:44,416][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:12:45,209][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:12:46,453][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:12:47,246][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:12:48,042][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:12:48,837][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:12:49,630][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:12:50,423][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:12:51,220][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:12:52,014][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:12:52,808][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:12:53,602][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:12:54,399][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:12:55,194][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:12:56,779][mllm.training.trainer_common][INFO] - For task: Apply reinforce step
,
            ΔVRAM Allocated: 2.288818359375e-05 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:26,
            Percentage of VRAM taken: 69.91579512747062%,
            
[2025-08-20 15:12:57,697][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/policy_optimizer_state.pt
[2025-08-20 15:12:57,698][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/critic_optimizer_state.pt
[2025-08-20 15:12:59,386][__main__][INFO] - Iteration 452 took 55s (37.88% Gen, 62.12% Train). Generation: 21s, Training: 34s. Estimated remaining time: 8h 22m 41s. Estimated total time: 15h 25m 9s. Time estimates for 10 more iterations: 9m 15s, 100 more iterations: 1h 32m 30s, 500 more iterations: 7h 42m 34s.
[2025-08-20 15:12:59,387][__main__][INFO] - Starting iteration 452.
[2025-08-20 15:13:23,621][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 15:13:23,622][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 15:13:23,628][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 15:13:26,072][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 15:13:26,073][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 15:13:26,079][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 15:13:26,082][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data.
[2025-08-20 15:13:26,082][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets.
[2025-08-20 15:13:26,383][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:13:27,173][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:13:27,967][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:13:28,761][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:13:29,554][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:13:30,348][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:13:31,143][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:13:31,936][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:13:32,730][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:13:33,526][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:13:34,322][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:13:35,116][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:13:35,909][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:13:36,704][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:13:37,496][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:13:38,289][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:13:39,085][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:13:39,878][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:13:40,671][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:13:41,465][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:13:42,785][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:13:43,580][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:13:44,373][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:13:45,167][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:13:45,963][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:13:46,758][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:13:47,551][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:13:48,344][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:13:49,142][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:13:49,937][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:13:50,730][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:13:51,524][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:13:53,109][mllm.training.trainer_common][INFO] - For task: Apply reinforce step
,
            ΔVRAM Allocated: 2.288818359375e-05 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:26,
            Percentage of VRAM taken: 69.91579512747062%,
            
[2025-08-20 15:13:54,299][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/policy_optimizer_state.pt
[2025-08-20 15:13:54,301][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/critic_optimizer_state.pt
[2025-08-20 15:13:55,773][__main__][INFO] - Iteration 453 took 56s (38.67% Gen, 61.33% Train). Generation: 21s, Training: 34s. Estimated remaining time: 8h 36m 20s. Estimated total time: 15h 39m 45s. Time estimates for 10 more iterations: 9m 23s, 100 more iterations: 1h 33m 58s, 500 more iterations: 7h 49m 52s.
[2025-08-20 15:13:55,774][__main__][INFO] - Starting iteration 453.
[2025-08-20 15:14:20,395][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 15:14:20,396][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 15:14:20,402][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 15:14:22,856][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 15:14:22,857][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 15:14:22,863][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 15:14:22,866][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data.
[2025-08-20 15:14:22,866][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets.
[2025-08-20 15:14:23,165][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:14:23,954][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:14:24,749][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:14:25,540][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:14:26,333][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:14:27,129][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:14:27,923][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:14:28,716][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:14:29,511][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:14:30,306][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:14:31,100][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:14:31,891][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:14:32,686][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:14:33,477][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:14:34,269][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:14:35,062][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:14:35,854][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:14:36,649][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:14:37,443][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:14:38,236][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:14:39,028][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:14:40,234][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:14:41,027][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:14:41,823][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:14:42,617][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:14:43,410][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:14:44,205][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:14:45,000][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:14:45,793][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:14:46,587][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:14:47,383][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:14:48,178][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:14:49,768][mllm.training.trainer_common][INFO] - For task: Apply reinforce step
,
            ΔVRAM Allocated: 2.288818359375e-05 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:26,
            Percentage of VRAM taken: 69.91579512747062%,
            
[2025-08-20 15:14:50,680][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/policy_optimizer_state.pt
[2025-08-20 15:14:50,681][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/critic_optimizer_state.pt
[2025-08-20 15:14:52,035][__main__][INFO] - Iteration 454 took 56s (39.39% Gen, 60.61% Train). Generation: 22s, Training: 34s. Estimated remaining time: 8h 33m 19s. Estimated total time: 15h 37m 40s. Time estimates for 10 more iterations: 9m 22s, 100 more iterations: 1h 33m 46s, 500 more iterations: 7h 48m 50s.
[2025-08-20 15:14:52,036][__main__][INFO] - Starting iteration 454.
[2025-08-20 15:15:15,100][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 15:15:15,102][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 15:15:15,108][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 15:15:17,561][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 15:15:17,563][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 15:15:17,569][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 15:15:17,571][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data.
[2025-08-20 15:15:17,572][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets.
[2025-08-20 15:15:17,875][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:15:18,665][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:15:19,455][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:15:20,248][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:15:21,040][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:15:21,833][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:15:22,626][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:15:23,418][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:15:24,212][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:15:25,005][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:15:25,797][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:15:26,590][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:15:27,382][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:15:28,173][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:15:28,967][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:15:29,758][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:15:30,551][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:15:31,345][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:15:32,138][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:15:32,931][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:15:33,726][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:15:34,518][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:15:35,741][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:15:36,532][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:15:37,326][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:15:38,118][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:15:38,910][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:15:39,705][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:15:40,497][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:15:41,289][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:15:42,083][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:15:42,875][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:15:44,439][mllm.training.trainer_common][INFO] - For task: Apply reinforce step
,
            ΔVRAM Allocated: 2.288818359375e-05 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:26,
            Percentage of VRAM taken: 69.91579512747062%,
            
[2025-08-20 15:15:45,510][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/policy_optimizer_state.pt
[2025-08-20 15:15:45,513][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/critic_optimizer_state.pt
[2025-08-20 15:15:46,829][__main__][INFO] - Iteration 455 took 54s (37.62% Gen, 62.38% Train). Generation: 20s, Training: 34s. Estimated remaining time: 8h 7m 56s. Estimated total time: 15h 13m 12s. Time estimates for 10 more iterations: 9m 7s, 100 more iterations: 1h 31m 19s, 500 more iterations: 7h 36m 36s.
[2025-08-20 15:15:46,830][__main__][INFO] - Starting iteration 455.
[2025-08-20 15:16:09,971][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 15:16:09,972][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 15:16:09,978][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 15:16:12,458][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 15:16:12,459][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 15:16:12,466][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 15:16:12,468][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data.
[2025-08-20 15:16:12,469][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets.
[2025-08-20 15:16:12,772][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:16:13,560][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:16:14,353][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:16:15,144][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:16:15,937][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:16:16,731][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:16:17,522][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:16:18,316][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:16:19,111][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:16:19,903][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:16:20,696][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:16:21,490][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:16:22,282][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:16:23,074][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:16:23,866][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:16:24,657][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:16:25,452][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:16:26,244][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:16:27,036][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:16:27,832][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:16:29,167][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:16:29,959][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:16:30,753][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:16:31,549][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:16:32,341][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:16:33,134][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:16:33,931][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:16:34,726][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:16:35,518][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:16:36,311][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:16:37,108][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:16:37,901][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:16:39,475][mllm.training.trainer_common][INFO] - For task: Apply reinforce step
,
            ΔVRAM Allocated: 2.288818359375e-05 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:26,
            Percentage of VRAM taken: 69.91579512747062%,
            
[2025-08-20 15:16:40,446][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/policy_optimizer_state.pt
[2025-08-20 15:16:40,447][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/critic_optimizer_state.pt
[2025-08-20 15:16:41,792][__main__][INFO] - Iteration 456 took 54s (37.57% Gen, 62.42% Train). Generation: 20s, Training: 34s. Estimated remaining time: 8h 9m 50s. Estimated total time: 15h 16m 0s. Time estimates for 10 more iterations: 9m 9s, 100 more iterations: 1h 31m 36s, 500 more iterations: 7h 38m 0s.
[2025-08-20 15:16:41,793][__main__][INFO] - Starting iteration 456.
[2025-08-20 15:17:05,248][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 15:17:05,250][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 15:17:05,256][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 15:17:07,701][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 15:17:07,702][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 15:17:07,708][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 15:17:07,711][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data.
[2025-08-20 15:17:07,711][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets.
[2025-08-20 15:17:08,010][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:17:08,800][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:17:09,595][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:17:10,388][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:17:11,182][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:17:11,979][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:17:12,775][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:17:13,569][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:17:14,363][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:17:15,158][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:17:15,953][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:17:16,745][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:17:17,538][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:17:18,333][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:17:19,126][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:17:19,919][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:17:20,717][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:17:21,512][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:17:22,306][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:17:23,597][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:17:24,394][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:17:25,189][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:17:25,982][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:17:26,777][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:17:27,574][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:17:28,369][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:17:29,163][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:17:29,957][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:17:30,754][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:17:31,550][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:17:32,344][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:17:33,139][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:17:34,734][mllm.training.trainer_common][INFO] - For task: Apply reinforce step
,
            ΔVRAM Allocated: 2.288818359375e-05 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:26,
            Percentage of VRAM taken: 69.91579512747062%,
            
[2025-08-20 15:17:35,661][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/policy_optimizer_state.pt
[2025-08-20 15:17:35,662][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/critic_optimizer_state.pt
[2025-08-20 15:17:36,964][__main__][INFO] - Iteration 457 took 55s (38.09% Gen, 61.91% Train). Generation: 21s, Training: 34s. Estimated remaining time: 8h 12m 24s. Estimated total time: 15h 19m 30s. Time estimates for 10 more iterations: 9m 11s, 100 more iterations: 1h 31m 57s, 500 more iterations: 7h 39m 45s.
[2025-08-20 15:17:36,966][__main__][INFO] - Starting iteration 457.
[2025-08-20 15:18:00,388][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 15:18:00,390][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 15:18:00,396][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 15:18:02,843][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 15:18:02,844][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 15:18:02,850][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 15:18:02,853][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data.
[2025-08-20 15:18:02,854][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets.
[2025-08-20 15:18:03,159][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:18:03,949][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:18:04,742][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:18:05,537][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:18:06,331][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:18:07,125][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:18:07,922][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:18:08,718][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:18:09,512][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:18:10,304][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:18:11,099][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:18:11,893][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:18:12,685][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:18:13,480][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:18:14,274][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:18:15,067][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:18:15,861][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:18:16,658][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:18:17,453][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:18:18,246][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:18:19,484][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:18:20,278][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:18:21,073][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:18:21,866][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:18:22,660][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:18:23,456][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:18:24,252][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:18:25,045][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:18:25,839][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:18:26,636][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:18:27,433][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:18:28,227][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:18:29,793][mllm.training.trainer_common][INFO] - For task: Apply reinforce step
,
            ΔVRAM Allocated: 2.288818359375e-05 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:26,
            Percentage of VRAM taken: 69.91579512747062%,
            
[2025-08-20 15:18:30,760][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/policy_optimizer_state.pt
[2025-08-20 15:18:30,761][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/critic_optimizer_state.pt
[2025-08-20 15:18:32,310][__main__][INFO] - Iteration 458 took 55s (37.85% Gen, 62.14% Train). Generation: 20s, Training: 34s. Estimated remaining time: 8h 14m 23s. Estimated total time: 15h 22m 24s. Time estimates for 10 more iterations: 9m 13s, 100 more iterations: 1h 32m 14s, 500 more iterations: 7h 41m 12s.
[2025-08-20 15:18:32,312][__main__][INFO] - Starting iteration 458.
[2025-08-20 15:18:55,297][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 15:18:55,298][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 15:18:55,305][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 15:18:57,769][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 15:18:57,771][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 15:18:57,777][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 15:18:57,780][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data.
[2025-08-20 15:18:57,780][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets.
[2025-08-20 15:18:58,079][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:18:58,869][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:18:59,665][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:19:00,458][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:19:01,251][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:19:02,046][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:19:02,842][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:19:03,636][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:19:04,428][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:19:05,224][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:19:06,018][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:19:06,809][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:19:07,604][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:19:08,399][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:19:09,192][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:19:09,985][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:19:10,782][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:19:11,578][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:19:12,372][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:19:13,165][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:19:14,443][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:19:15,238][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:19:16,031][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:19:16,825][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:19:17,621][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:19:18,417][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:19:19,212][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:19:20,006][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:19:20,801][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:19:21,599][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:19:22,395][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:19:23,188][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:19:24,807][mllm.training.trainer_common][INFO] - For task: Apply reinforce step
,
            ΔVRAM Allocated: 2.288818359375e-05 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:26,
            Percentage of VRAM taken: 69.91579512747062%,
            
[2025-08-20 15:19:25,762][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/policy_optimizer_state.pt
[2025-08-20 15:19:25,764][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/critic_optimizer_state.pt
[2025-08-20 15:19:27,015][__main__][INFO] - Iteration 459 took 54s (37.52% Gen, 62.47% Train). Generation: 20s, Training: 34s. Estimated remaining time: 8h 2m 46s. Estimated total time: 15h 11m 42s. Time estimates for 10 more iterations: 9m 7s, 100 more iterations: 1h 31m 10s, 500 more iterations: 7h 35m 51s.
[2025-08-20 15:19:27,016][__main__][INFO] - Starting iteration 459.
[2025-08-20 15:19:50,193][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 15:19:50,195][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 15:19:50,201][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 15:19:52,666][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 15:19:52,668][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 15:19:52,674][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 15:19:52,676][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data.
[2025-08-20 15:19:52,677][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets.
[2025-08-20 15:19:52,975][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:19:53,768][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:19:54,562][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:19:55,354][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:19:56,150][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:19:56,946][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:19:57,740][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:19:58,534][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:19:59,330][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:20:00,126][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:20:00,920][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:20:01,712][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:20:02,506][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:20:03,301][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:20:04,093][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:20:04,887][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:20:05,685][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:20:06,479][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:20:07,273][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:20:08,066][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:20:08,863][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:20:09,659][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:20:10,452][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:20:11,662][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:20:12,457][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:20:13,249][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:20:14,044][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:20:14,841][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:20:15,637][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:20:16,431][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:20:17,225][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:20:18,022][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:20:19,623][mllm.training.trainer_common][INFO] - For task: Apply reinforce step
,
            ΔVRAM Allocated: 2.288818359375e-05 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:26,
            Percentage of VRAM taken: 69.91579512747062%,
            
[2025-08-20 15:20:20,601][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/policy_optimizer_state.pt
[2025-08-20 15:20:20,602][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/critic_optimizer_state.pt
[2025-08-20 15:20:22,031][__main__][INFO] - Iteration 460 took 55s (37.64% Gen, 62.35% Train). Generation: 20s, Training: 34s. Estimated remaining time: 8h 7m 3s. Estimated total time: 15h 16m 54s. Time estimates for 10 more iterations: 9m 10s, 100 more iterations: 1h 31m 41s, 500 more iterations: 7h 38m 27s.
[2025-08-20 15:20:22,033][__main__][INFO] - Starting iteration 460.
[2025-08-20 15:20:45,027][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 15:20:45,028][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 15:20:45,035][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 15:20:47,490][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 15:20:47,491][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 15:20:47,498][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 15:20:47,500][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data.
[2025-08-20 15:20:47,501][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets.
[2025-08-20 15:20:47,802][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:20:48,593][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:20:49,388][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:20:50,181][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:20:50,975][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:20:51,771][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:20:52,567][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:20:53,360][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:20:54,153][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:20:54,948][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:20:55,742][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:20:56,534][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:20:57,329][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:20:58,123][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:20:58,917][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:20:59,711][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:21:00,508][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:21:01,303][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:21:02,096][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:21:02,890][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:21:03,687][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:21:04,482][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:21:05,275][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:21:06,577][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:21:07,369][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:21:08,165][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:21:08,960][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:21:09,753][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:21:10,545][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:21:11,340][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:21:12,133][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:21:12,926][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:21:14,500][mllm.training.trainer_common][INFO] - For task: Apply reinforce step
,
            ΔVRAM Allocated: 2.288818359375e-05 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:26,
            Percentage of VRAM taken: 69.91579512747062%,
            
[2025-08-20 15:21:15,420][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/policy_optimizer_state.pt
[2025-08-20 15:21:15,421][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/critic_optimizer_state.pt
[2025-08-20 15:21:16,859][__main__][INFO] - Iteration 461 took 54s (37.45% Gen, 62.55% Train). Generation: 20s, Training: 34s. Estimated remaining time: 8h 2m 59s. Estimated total time: 15h 13m 45s. Time estimates for 10 more iterations: 9m 8s, 100 more iterations: 1h 31m 22s, 500 more iterations: 7h 36m 52s.
[2025-08-20 15:21:16,860][__main__][INFO] - Starting iteration 461.
[2025-08-20 15:21:39,835][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 15:21:39,837][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 15:21:39,843][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 15:21:42,327][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 15:21:42,329][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 15:21:42,335][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 15:21:42,337][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data.
[2025-08-20 15:21:42,338][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets.
[2025-08-20 15:21:42,636][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:21:43,430][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:21:44,221][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:21:45,015][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:21:45,811][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:21:46,605][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:21:47,398][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:21:48,193][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:21:48,987][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:21:49,780][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:21:50,573][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:21:51,368][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:21:52,161][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:21:52,954][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:21:53,751][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:21:55,031][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:21:55,827][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:21:56,620][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:21:57,413][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:21:58,210][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:21:59,006][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:21:59,800][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:22:00,593][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:22:01,389][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:22:02,185][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:22:02,979][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:22:03,773][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:22:04,569][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:22:05,366][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:22:06,161][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:22:06,954][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:22:07,750][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:22:09,367][mllm.training.trainer_common][INFO] - For task: Apply reinforce step
,
            ΔVRAM Allocated: 2.288818359375e-05 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:26,
            Percentage of VRAM taken: 69.91579512747062%,
            
[2025-08-20 15:22:10,320][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/policy_optimizer_state.pt
[2025-08-20 15:22:10,322][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/critic_optimizer_state.pt
[2025-08-20 15:22:11,715][__main__][INFO] - Iteration 462 took 54s (37.42% Gen, 62.57% Train). Generation: 20s, Training: 34s. Estimated remaining time: 8h 2m 33s. Estimated total time: 15h 14m 14s. Time estimates for 10 more iterations: 9m 8s, 100 more iterations: 1h 31m 25s, 500 more iterations: 7h 37m 7s.
[2025-08-20 15:22:11,716][__main__][INFO] - Starting iteration 462.
[2025-08-20 15:22:34,901][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 15:22:34,903][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 15:22:34,909][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 15:22:37,383][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 15:22:37,384][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 15:22:37,391][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 15:22:37,394][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data.
[2025-08-20 15:22:37,394][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets.
[2025-08-20 15:22:37,693][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:22:38,483][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:22:39,277][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:22:40,071][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:22:40,864][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:22:41,658][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:22:42,454][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:22:43,246][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:22:44,038][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:22:44,833][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:22:45,626][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:22:46,417][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:22:47,213][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:22:48,008][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:22:48,801][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:22:49,596][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:22:50,392][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:22:51,185][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:22:51,978][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:22:52,772][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:22:53,567][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:22:54,359][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:22:55,154][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:22:56,365][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:22:57,159][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:22:57,953][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:22:58,749][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:22:59,544][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:23:00,337][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:23:01,134][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:23:01,931][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:23:02,728][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:23:04,283][mllm.training.trainer_common][INFO] - For task: Apply reinforce step
,
            ΔVRAM Allocated: 2.288818359375e-05 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:26,
            Percentage of VRAM taken: 69.91579512747062%,
            
[2025-08-20 15:23:05,226][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/policy_optimizer_state.pt
[2025-08-20 15:23:05,227][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/critic_optimizer_state.pt
[2025-08-20 15:23:06,618][__main__][INFO] - Iteration 463 took 54s (37.72% Gen, 62.28% Train). Generation: 20s, Training: 34s. Estimated remaining time: 8h 2m 25s. Estimated total time: 15h 15m 1s. Time estimates for 10 more iterations: 9m 9s, 100 more iterations: 1h 31m 30s, 500 more iterations: 7h 37m 30s.
[2025-08-20 15:23:06,620][__main__][INFO] - Starting iteration 463.
[2025-08-20 15:23:29,986][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 15:23:29,987][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 15:23:29,994][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 15:23:32,462][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 15:23:32,463][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 15:23:32,469][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 15:23:32,472][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data.
[2025-08-20 15:23:32,473][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets.
[2025-08-20 15:23:32,774][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:23:33,564][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:23:34,359][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:23:35,155][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:23:35,948][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:23:36,742][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:23:37,537][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:23:38,330][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:23:39,122][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:23:39,917][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:23:40,710][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:23:41,502][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:23:42,298][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:23:43,093][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:23:43,887][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:23:44,680][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:23:45,476][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:23:46,272][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:23:47,066][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:23:47,859][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:23:49,134][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:23:49,928][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:23:50,721][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:23:51,516][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:23:52,311][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:23:53,104][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:23:53,897][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:23:54,694][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:23:55,490][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:23:56,283][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:23:57,077][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:23:57,873][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:23:59,438][mllm.training.trainer_common][INFO] - For task: Apply reinforce step
,
            ΔVRAM Allocated: 2.288818359375e-05 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:26,
            Percentage of VRAM taken: 69.91579512747062%,
            
[2025-08-20 15:24:00,409][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/policy_optimizer_state.pt
[2025-08-20 15:24:00,411][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/critic_optimizer_state.pt
[2025-08-20 15:24:01,700][__main__][INFO] - Iteration 464 took 55s (37.97% Gen, 62.03% Train). Generation: 20s, Training: 34s. Estimated remaining time: 8h 4m 28s. Estimated total time: 15h 17m 59s. Time estimates for 10 more iterations: 9m 10s, 100 more iterations: 1h 31m 47s, 500 more iterations: 7h 38m 59s.
[2025-08-20 15:24:01,701][__main__][INFO] - Starting iteration 464.
[2025-08-20 15:24:24,767][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 15:24:24,769][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 15:24:24,775][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 15:24:27,223][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 15:24:27,224][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 15:24:27,230][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 15:24:27,233][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data.
[2025-08-20 15:24:27,233][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets.
[2025-08-20 15:24:27,531][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:24:28,322][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:24:29,115][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:24:29,908][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:24:30,703][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:24:31,498][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:24:32,291][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:24:33,083][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:24:33,878][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:24:34,670][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:24:35,462][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:24:36,257][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:24:37,050][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:24:37,844][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:24:38,640][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:24:39,435][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:24:40,228][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:24:41,021][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:24:41,818][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:24:42,614][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:24:43,407][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:24:44,201][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:24:45,467][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:24:46,260][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:24:47,054][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:24:47,847][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:24:48,641][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:24:49,437][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:24:50,232][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:24:51,025][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:24:51,819][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:24:52,616][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:24:54,259][mllm.training.trainer_common][INFO] - For task: Apply reinforce step
,
            ΔVRAM Allocated: 2.288818359375e-05 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:26,
            Percentage of VRAM taken: 69.91579512747062%,
            
[2025-08-20 15:24:55,189][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/policy_optimizer_state.pt
[2025-08-20 15:24:55,190][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/critic_optimizer_state.pt
[2025-08-20 15:24:56,736][__main__][INFO] - Iteration 465 took 55s (37.49% Gen, 62.51% Train). Generation: 20s, Training: 34s. Estimated remaining time: 8h 2m 49s. Estimated total time: 15h 17m 14s. Time estimates for 10 more iterations: 9m 10s, 100 more iterations: 1h 31m 43s, 500 more iterations: 7h 38m 37s.
[2025-08-20 15:24:56,738][__main__][INFO] - Starting iteration 465.
[2025-08-20 15:25:19,776][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 15:25:19,777][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 15:25:19,784][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 15:25:22,264][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 15:25:22,266][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 15:25:22,272][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 15:25:22,275][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data.
[2025-08-20 15:25:22,275][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets.
[2025-08-20 15:25:22,574][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:25:23,366][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:25:24,160][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:25:24,953][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:25:25,748][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:25:26,544][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:25:27,337][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:25:28,129][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:25:28,924][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:25:29,719][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:25:30,510][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:25:31,304][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:25:32,100][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:25:32,893][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:25:33,686][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:25:34,481][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:25:35,275][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:25:36,486][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:25:37,281][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:25:38,074][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:25:38,866][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:25:39,660][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:25:40,455][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:25:41,247][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:25:42,041][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:25:42,835][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:25:43,628][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:25:44,422][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:25:45,219][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:25:46,014][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:25:46,807][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:25:47,601][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:25:49,165][mllm.training.trainer_common][INFO] - For task: Apply reinforce step
,
            ΔVRAM Allocated: 2.288818359375e-05 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:26,
            Percentage of VRAM taken: 69.91579512747062%,
            
[2025-08-20 15:25:50,107][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/policy_optimizer_state.pt
[2025-08-20 15:25:50,109][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/critic_optimizer_state.pt
[2025-08-20 15:25:51,550][__main__][INFO] - Iteration 466 took 54s (37.55% Gen, 62.45% Train). Generation: 20s, Training: 34s. Estimated remaining time: 7h 58m 10s. Estimated total time: 15h 13m 31s. Time estimates for 10 more iterations: 9m 8s, 100 more iterations: 1h 31m 21s, 500 more iterations: 7h 36m 45s.
[2025-08-20 15:25:51,551][__main__][INFO] - Starting iteration 466.
[2025-08-20 15:26:14,615][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 15:26:14,617][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 15:26:14,623][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 15:26:17,114][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 15:26:17,116][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 15:26:17,122][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 15:26:17,124][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data.
[2025-08-20 15:26:17,125][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets.
[2025-08-20 15:26:17,425][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:26:18,216][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:26:19,009][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:26:19,805][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:26:20,598][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:26:21,389][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:26:22,184][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:26:22,975][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:26:23,768][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:26:24,562][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:26:25,354][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:26:26,148][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:26:26,943][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:26:27,736][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:26:28,528][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:26:29,324][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:26:30,119][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:26:31,354][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:26:32,148][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:26:32,944][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:26:33,739][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:26:34,531][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:26:35,325][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:26:36,123][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:26:36,919][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:26:37,713][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:26:38,506][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:26:39,303][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:26:40,100][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:26:40,893][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:26:41,687][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:26:42,482][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:26:44,053][mllm.training.trainer_common][INFO] - For task: Apply reinforce step
,
            ΔVRAM Allocated: 2.288818359375e-05 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:26,
            Percentage of VRAM taken: 69.91579512747062%,
            
[2025-08-20 15:26:45,012][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/policy_optimizer_state.pt
[2025-08-20 15:26:45,014][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/critic_optimizer_state.pt
[2025-08-20 15:26:46,401][__main__][INFO] - Iteration 467 took 54s (37.54% Gen, 62.46% Train). Generation: 20s, Training: 34s. Estimated remaining time: 7h 57m 53s. Estimated total time: 15h 14m 9s. Time estimates for 10 more iterations: 9m 8s, 100 more iterations: 1h 31m 24s, 500 more iterations: 7h 37m 4s.
[2025-08-20 15:26:46,402][__main__][INFO] - Starting iteration 467.
[2025-08-20 15:27:10,272][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 15:27:10,273][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 15:27:10,280][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 15:27:12,786][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 15:27:12,787][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 15:27:12,794][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 15:27:12,796][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data.
[2025-08-20 15:27:12,796][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets.
[2025-08-20 15:27:13,097][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:27:13,890][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:27:14,684][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:27:15,477][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:27:16,271][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:27:17,065][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:27:17,857][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:27:18,651][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:27:19,445][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:27:20,237][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:27:21,031][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:27:21,827][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:27:22,621][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:27:23,413][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:27:24,209][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:27:25,004][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:27:25,799][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:27:26,592][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:27:27,389][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:27:28,185][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:27:28,980][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:27:29,774][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:27:30,568][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:27:31,365][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:27:32,160][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:27:32,954][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:27:33,748][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:27:35,032][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:27:35,825][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:27:36,620][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:27:37,412][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:27:38,206][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:27:39,788][mllm.training.trainer_common][INFO] - For task: Apply reinforce step
,
            ΔVRAM Allocated: 2.288818359375e-05 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:26,
            Percentage of VRAM taken: 69.91579512747062%,
            
[2025-08-20 15:27:40,705][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/policy_optimizer_state.pt
[2025-08-20 15:27:40,707][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/critic_optimizer_state.pt
[2025-08-20 15:27:42,079][__main__][INFO] - Iteration 468 took 55s (38.40% Gen, 61.60% Train). Generation: 21s, Training: 34s. Estimated remaining time: 8h 10m 45s. Estimated total time: 15h 27m 56s. Time estimates for 10 more iterations: 9m 16s, 100 more iterations: 1h 32m 47s, 500 more iterations: 7h 43m 58s.
[2025-08-20 15:27:42,081][__main__][INFO] - Starting iteration 468.
[2025-08-20 15:28:05,586][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 15:28:05,587][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 15:28:05,593][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 15:28:08,068][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 15:28:08,069][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 15:28:08,075][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 15:28:08,078][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data.
[2025-08-20 15:28:08,078][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets.
[2025-08-20 15:28:08,377][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:28:09,171][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:28:09,965][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:28:10,758][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:28:11,553][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:28:12,347][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:28:13,139][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:28:13,933][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:28:14,727][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:28:15,519][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:28:16,314][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:28:17,110][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:28:17,903][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:28:18,696][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:28:19,491][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:28:20,287][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:28:21,080][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:28:21,873][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:28:22,670][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:28:23,465][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:28:24,259][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:28:25,053][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:28:25,850][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:28:27,138][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:28:27,932][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:28:28,729][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:28:29,525][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:28:30,319][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:28:31,113][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:28:31,909][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:28:32,706][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:28:33,501][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:28:35,086][mllm.training.trainer_common][INFO] - For task: Apply reinforce step
,
            ΔVRAM Allocated: 2.288818359375e-05 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:26,
            Percentage of VRAM taken: 69.91579512747062%,
            
[2025-08-20 15:28:36,066][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/policy_optimizer_state.pt
[2025-08-20 15:28:36,068][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/critic_optimizer_state.pt
[2025-08-20 15:28:37,417][__main__][INFO] - Iteration 469 took 55s (38.02% Gen, 61.98% Train). Generation: 21s, Training: 34s. Estimated remaining time: 8h 4m 9s. Estimated total time: 15h 22m 16s. Time estimates for 10 more iterations: 9m 13s, 100 more iterations: 1h 32m 13s, 500 more iterations: 7h 41m 8s.
[2025-08-20 15:28:37,421][__main__][INFO] - Starting iteration 469.
[2025-08-20 15:29:00,510][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 15:29:00,512][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 15:29:00,518][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 15:29:02,993][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 15:29:02,995][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 15:29:03,001][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 15:29:03,003][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data.
[2025-08-20 15:29:03,004][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets.
[2025-08-20 15:29:03,308][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:29:04,099][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:29:04,894][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:29:05,685][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:29:06,477][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:29:07,270][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:29:08,062][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:29:08,856][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:29:09,651][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:29:10,443][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:29:11,237][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:29:12,033][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:29:12,827][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:29:13,620][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:29:14,416][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:29:15,212][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:29:16,006][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:29:16,799][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:29:17,595][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:29:18,391][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:29:19,187][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:29:19,980][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:29:20,775][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:29:22,078][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:29:22,873][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:29:23,670][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:29:24,466][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:29:25,260][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:29:26,053][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:29:26,851][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:29:27,649][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:29:28,443][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:29:30,025][mllm.training.trainer_common][INFO] - For task: Apply reinforce step
,
            ΔVRAM Allocated: 2.288818359375e-05 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:26,
            Percentage of VRAM taken: 69.91579512747062%,
            
[2025-08-20 15:29:30,980][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/policy_optimizer_state.pt
[2025-08-20 15:29:30,982][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/critic_optimizer_state.pt
[2025-08-20 15:29:32,572][__main__][INFO] - Iteration 470 took 55s (37.39% Gen, 62.61% Train). Generation: 20s, Training: 34s. Estimated remaining time: 8h 0m 8s. Estimated total time: 15h 19m 9s. Time estimates for 10 more iterations: 9m 11s, 100 more iterations: 1h 31m 54s, 500 more iterations: 7h 39m 34s.
[2025-08-20 15:29:32,573][__main__][INFO] - Starting iteration 470.
[2025-08-20 15:30:00,282][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 15:30:00,284][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 15:30:00,290][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 15:30:02,749][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 15:30:02,750][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 15:30:02,757][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 15:30:02,759][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data.
[2025-08-20 15:30:02,759][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets.
[2025-08-20 15:30:03,057][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:30:03,845][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:30:04,638][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:30:05,431][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:30:06,222][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:30:07,017][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:30:07,810][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:30:08,603][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:30:09,397][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:30:10,191][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:30:10,983][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:30:11,777][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:30:12,572][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:30:13,364][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:30:14,157][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:30:14,953][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:30:15,748][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:30:16,540][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:30:17,802][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:30:18,595][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:30:19,391][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:30:20,184][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:30:20,977][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:30:21,773][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:30:22,568][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:30:23,361][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:30:24,155][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:30:24,952][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:30:25,747][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:30:26,540][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:30:27,334][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:30:28,131][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:30:29,767][mllm.training.trainer_common][INFO] - For task: Apply reinforce step
,
            ΔVRAM Allocated: 2.288818359375e-05 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:26,
            Percentage of VRAM taken: 69.91579512747062%,
            
[2025-08-20 15:30:30,746][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/policy_optimizer_state.pt
[2025-08-20 15:30:30,747][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/critic_optimizer_state.pt
[2025-08-20 15:30:32,248][__main__][INFO] - Iteration 471 took 59s (42.30% Gen, 57.70% Train). Generation: 25s, Training: 34s. Estimated remaining time: 9h 14m 32s. Estimated total time: 16h 34m 33s. Time estimates for 10 more iterations: 9m 56s, 100 more iterations: 1h 39m 27s, 500 more iterations: 8h 17m 16s.
[2025-08-20 15:30:32,249][__main__][INFO] - Starting iteration 471.
[2025-08-20 15:30:56,383][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 15:30:56,384][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 15:30:56,390][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 15:30:58,854][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 15:30:58,855][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 15:30:58,861][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 15:30:58,864][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data.
[2025-08-20 15:30:58,864][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets.
[2025-08-20 15:30:59,163][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:30:59,955][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:31:00,747][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:31:01,540][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:31:02,334][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:31:03,126][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:31:03,920][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:31:04,714][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:31:05,506][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:31:06,299][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:31:07,094][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:31:07,886][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:31:08,679][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:31:09,474][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:31:10,267][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:31:11,060][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:31:11,856][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:31:12,650][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:31:13,852][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:31:14,646][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:31:15,439][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:31:16,236][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:31:17,031][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:31:17,824][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:31:18,618][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:31:19,411][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:31:20,205][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:31:20,998][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:31:21,794][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:31:22,590][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:31:23,383][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:31:24,178][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:31:25,773][mllm.training.trainer_common][INFO] - For task: Apply reinforce step
,
            ΔVRAM Allocated: 2.288818359375e-05 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:26,
            Percentage of VRAM taken: 69.91579512747062%,
            
[2025-08-20 15:31:26,693][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/policy_optimizer_state.pt
[2025-08-20 15:31:26,694][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/critic_optimizer_state.pt
[2025-08-20 15:31:28,055][__main__][INFO] - Iteration 472 took 55s (38.84% Gen, 61.16% Train). Generation: 21s, Training: 34s. Estimated remaining time: 8h 9m 8s. Estimated total time: 15h 30m 5s. Time estimates for 10 more iterations: 9m 18s, 100 more iterations: 1h 33m 0s, 500 more iterations: 7h 45m 2s.
[2025-08-20 15:31:28,057][__main__][INFO] - Starting iteration 472.
[2025-08-20 15:31:51,424][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 15:31:51,426][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 15:31:51,432][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 15:31:53,892][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 15:31:53,894][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 15:31:53,900][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 15:31:53,902][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data.
[2025-08-20 15:31:53,903][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets.
[2025-08-20 15:31:54,203][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:31:54,995][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:31:55,786][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:31:56,582][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:31:57,374][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:31:58,166][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:31:58,962][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:31:59,756][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:32:00,549][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:32:01,343][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:32:02,140][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:32:02,933][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:32:03,726][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:32:04,521][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:32:05,316][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:32:06,109][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:32:06,903][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:32:07,697][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:32:08,490][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:32:09,875][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:32:10,668][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:32:11,461][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:32:12,258][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:32:13,053][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:32:13,846][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:32:14,640][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:32:15,437][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:32:16,232][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:32:17,025][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:32:17,819][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:32:18,617][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:32:19,415][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:32:21,003][mllm.training.trainer_common][INFO] - For task: Apply reinforce step
,
            ΔVRAM Allocated: 2.288818359375e-05 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:27,
            Percentage of VRAM taken: 69.91579512747062%,
            
[2025-08-20 15:32:21,921][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/policy_optimizer_state.pt
[2025-08-20 15:32:21,923][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/critic_optimizer_state.pt
[2025-08-20 15:32:25,194][__main__][INFO] - Iteration 473 took 57s (36.62% Gen, 63.38% Train). Generation: 20s, Training: 36s. Estimated remaining time: 8h 30m 22s. Estimated total time: 15h 52m 16s. Time estimates for 10 more iterations: 9m 31s, 100 more iterations: 1h 35m 13s, 500 more iterations: 7h 56m 8s.
[2025-08-20 15:32:25,195][__main__][INFO] - Starting iteration 473.
[2025-08-20 15:32:48,288][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 15:32:48,289][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 15:32:48,295][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 15:32:50,741][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 15:32:50,742][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 15:32:50,749][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 15:32:50,751][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data.
[2025-08-20 15:32:50,752][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets.
[2025-08-20 15:32:51,049][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:32:51,839][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:32:52,628][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:32:53,420][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:32:54,211][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:32:55,001][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:32:55,791][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:32:56,583][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:32:57,375][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:32:58,165][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:32:58,959][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:32:59,750][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:33:00,543][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:33:01,334][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:33:02,126][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:33:02,920][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:33:03,712][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:33:04,505][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:33:05,709][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:33:06,502][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:33:07,297][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:33:08,091][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:33:08,884][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:33:09,681][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:33:10,476][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:33:12,304][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:33:13,100][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:33:13,896][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:33:14,690][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:33:15,483][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:33:16,277][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:33:17,073][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:33:18,657][mllm.training.trainer_common][INFO] - For task: Apply reinforce step
,
            ΔVRAM Allocated: 2.288818359375e-05 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:27,
            Percentage of VRAM taken: 69.91579512747062%,
            
[2025-08-20 15:33:23,230][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/policy_optimizer_state.pt
[2025-08-20 15:33:23,231][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/critic_optimizer_state.pt
[2025-08-20 15:33:24,662][__main__][INFO] - Iteration 474 took 59s (34.68% Gen, 65.32% Train). Generation: 20s, Training: 38s. Estimated remaining time: 9h 8m 12s. Estimated total time: 16h 31m 6s. Time estimates for 10 more iterations: 9m 54s, 100 more iterations: 1h 39m 6s, 500 more iterations: 8h 15m 33s.
[2025-08-20 15:33:24,732][__main__][INFO] - Starting iteration 474.
[2025-08-20 15:33:47,721][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 15:33:47,722][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 15:33:47,729][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 15:33:50,179][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 15:33:50,180][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 15:33:50,187][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 15:33:50,189][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data.
[2025-08-20 15:33:50,189][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets.
[2025-08-20 15:33:50,488][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:33:51,280][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:33:52,072][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:33:52,865][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:33:53,658][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:33:54,449][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:33:55,244][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:33:56,036][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:33:56,828][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:33:57,623][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:33:58,415][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:33:59,208][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:34:00,004][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:34:00,798][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:34:01,593][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:34:02,386][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:34:03,182][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:34:03,978][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:34:04,770][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:34:05,565][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:34:06,361][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:34:07,155][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:34:07,948][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:34:08,743][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:34:09,969][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:34:10,763][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:34:11,559][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:34:12,353][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:34:13,147][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:34:13,942][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:34:14,738][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:34:15,533][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:34:17,111][mllm.training.trainer_common][INFO] - For task: Apply reinforce step
,
            ΔVRAM Allocated: 2.288818359375e-05 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:26,
            Percentage of VRAM taken: 69.91579512747062%,
            
[2025-08-20 15:34:18,057][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/policy_optimizer_state.pt
[2025-08-20 15:34:18,058][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/critic_optimizer_state.pt
[2025-08-20 15:34:19,463][__main__][INFO] - Iteration 475 took 54s (37.54% Gen, 62.45% Train). Generation: 20s, Training: 34s. Estimated remaining time: 7h 48m 21s. Estimated total time: 15h 12m 9s. Time estimates for 10 more iterations: 9m 7s, 100 more iterations: 1h 31m 12s, 500 more iterations: 7h 36m 4s.
[2025-08-20 15:34:19,464][__main__][INFO] - Starting iteration 475.
[2025-08-20 15:34:42,908][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 15:34:42,910][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 15:34:42,916][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 15:34:45,351][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 15:34:45,352][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 15:34:45,359][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 15:34:45,361][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data.
[2025-08-20 15:34:45,362][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets.
[2025-08-20 15:34:45,661][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:34:46,450][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:34:47,242][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:34:48,033][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:34:48,828][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:34:49,621][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:34:50,414][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:34:51,208][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:34:52,002][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:34:52,794][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:34:53,588][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:34:54,383][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:34:55,176][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:34:55,970][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:34:56,767][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:34:57,562][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:34:58,354][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:34:59,149][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:34:59,948][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:35:01,236][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:35:02,030][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:35:02,826][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:35:03,621][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:35:04,414][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:35:05,207][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:35:06,004][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:35:06,799][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:35:07,592][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:35:08,386][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:35:09,183][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:35:09,979][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:35:10,775][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:35:12,361][mllm.training.trainer_common][INFO] - For task: Apply reinforce step
,
            ΔVRAM Allocated: 2.288818359375e-05 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:26,
            Percentage of VRAM taken: 69.91579512747062%,
            
[2025-08-20 15:35:13,309][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/policy_optimizer_state.pt
[2025-08-20 15:35:13,311][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/critic_optimizer_state.pt
[2025-08-20 15:35:14,795][__main__][INFO] - Iteration 476 took 55s (37.99% Gen, 62.01% Train). Generation: 21s, Training: 34s. Estimated remaining time: 7h 57m 26s. Estimated total time: 15h 22m 10s. Time estimates for 10 more iterations: 9m 13s, 100 more iterations: 1h 32m 13s, 500 more iterations: 7h 41m 5s.
[2025-08-20 15:35:14,796][__main__][INFO] - Starting iteration 476.
[2025-08-20 15:35:37,872][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 15:35:37,873][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 15:35:37,880][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 15:35:40,335][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 15:35:40,336][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 15:35:40,342][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 15:35:40,345][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data.
[2025-08-20 15:35:40,345][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets.
[2025-08-20 15:35:40,643][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:35:41,454][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:35:42,247][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:35:43,037][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:35:43,831][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:35:44,624][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:35:45,415][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:35:46,210][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:35:47,004][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:35:47,795][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:35:48,590][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:35:49,383][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:35:50,176][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:35:50,970][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:35:51,765][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:35:52,557][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:35:53,351][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:35:54,148][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:35:54,943][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:35:55,737][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:35:56,938][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:35:57,731][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:35:58,528][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:35:59,323][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:36:00,116][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:36:00,910][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:36:01,704][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:36:02,497][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:36:03,289][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:36:04,086][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:36:04,880][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:36:05,673][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:36:07,307][mllm.training.trainer_common][INFO] - For task: Apply reinforce step
,
            ΔVRAM Allocated: 2.288818359375e-05 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:26,
            Percentage of VRAM taken: 69.91579512747062%,
            
[2025-08-20 15:36:08,303][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/policy_optimizer_state.pt
[2025-08-20 15:36:08,540][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/critic_optimizer_state.pt
[2025-08-20 15:36:09,891][__main__][INFO] - Iteration 477 took 55s (37.45% Gen, 62.55% Train). Generation: 20s, Training: 34s. Estimated remaining time: 7h 52m 35s. Estimated total time: 15h 18m 13s. Time estimates for 10 more iterations: 9m 10s, 100 more iterations: 1h 31m 49s, 500 more iterations: 7h 39m 6s.
[2025-08-20 15:36:09,892][__main__][INFO] - Starting iteration 477.
[2025-08-20 15:36:33,282][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 15:36:33,284][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 15:36:33,290][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 15:36:35,791][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 15:36:35,793][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 15:36:35,799][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 15:36:35,801][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data.
[2025-08-20 15:36:35,802][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets.
[2025-08-20 15:36:36,102][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:36:36,892][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:36:37,682][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:36:38,474][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:36:39,265][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:36:40,056][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:36:40,850][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:36:41,641][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:36:42,433][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:36:43,226][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:36:44,018][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:36:44,812][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:36:45,606][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:36:46,397][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:36:47,193][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:36:47,987][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:36:48,779][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:36:50,035][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:36:50,832][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:36:51,628][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:36:52,422][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:36:53,216][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:36:54,012][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:36:54,808][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:36:55,602][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:36:56,395][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:36:57,191][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:36:57,988][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:36:58,783][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:36:59,577][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:37:00,371][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:37:01,170][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:37:02,769][mllm.training.trainer_common][INFO] - For task: Apply reinforce step
,
            ΔVRAM Allocated: 2.288818359375e-05 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:26,
            Percentage of VRAM taken: 69.91579512747062%,
            
[2025-08-20 15:37:03,716][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/policy_optimizer_state.pt
[2025-08-20 15:37:03,718][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/critic_optimizer_state.pt
[2025-08-20 15:37:05,036][__main__][INFO] - Iteration 478 took 55s (37.91% Gen, 62.09% Train). Generation: 20s, Training: 34s. Estimated remaining time: 7h 52m 30s. Estimated total time: 15h 19m 4s. Time estimates for 10 more iterations: 9m 11s, 100 more iterations: 1h 31m 54s, 500 more iterations: 7h 39m 32s.
[2025-08-20 15:37:05,038][__main__][INFO] - Starting iteration 478.
[2025-08-20 15:37:28,148][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 15:37:28,149][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 15:37:28,155][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 15:37:30,586][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 15:37:30,587][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 15:37:30,593][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 15:37:30,595][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data.
[2025-08-20 15:37:30,596][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets.
[2025-08-20 15:37:30,894][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:37:31,683][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:37:32,477][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:37:33,270][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:37:34,062][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:37:34,857][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:37:35,650][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:37:36,442][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:37:37,237][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:37:38,032][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:37:38,825][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:37:39,619][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:37:40,415][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:37:41,210][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:37:42,005][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:37:42,799][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:37:43,593][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:37:44,386][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:37:45,180][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:37:46,461][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:37:47,255][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:37:48,051][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:37:48,845][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:37:49,638][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:37:50,435][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:37:51,232][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:37:52,028][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:37:52,822][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:37:53,616][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:37:54,412][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:37:55,209][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:37:56,005][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:37:57,627][mllm.training.trainer_common][INFO] - For task: Apply reinforce step
,
            ΔVRAM Allocated: 2.288818359375e-05 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:26,
            Percentage of VRAM taken: 69.91579512747062%,
            
[2025-08-20 15:37:58,612][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/policy_optimizer_state.pt
[2025-08-20 15:37:58,614][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/critic_optimizer_state.pt
[2025-08-20 15:38:00,050][__main__][INFO] - Iteration 479 took 55s (37.59% Gen, 62.40% Train). Generation: 20s, Training: 34s. Estimated remaining time: 7h 49m 22s. Estimated total time: 15h 16m 51s. Time estimates for 10 more iterations: 9m 10s, 100 more iterations: 1h 31m 41s, 500 more iterations: 7h 38m 25s.
[2025-08-20 15:38:00,051][__main__][INFO] - Starting iteration 479.
[2025-08-20 15:38:23,323][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 15:38:23,325][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 15:38:23,331][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 15:38:25,778][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 15:38:25,779][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 15:38:25,786][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 15:38:25,788][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data.
[2025-08-20 15:38:25,788][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets.
[2025-08-20 15:38:26,086][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:38:26,878][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:38:27,670][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:38:28,462][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:38:29,256][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:38:30,048][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:38:30,842][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:38:31,636][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:38:32,429][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:38:33,221][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:38:34,018][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:38:34,812][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:38:35,606][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:38:36,400][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:38:37,196][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:38:37,991][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:38:38,784][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:38:39,579][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:38:40,375][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:38:41,169][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:38:41,962][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:38:42,758][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:38:44,055][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:38:44,849][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:38:45,650][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:38:46,444][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:38:47,238][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:38:48,035][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:38:48,831][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:38:49,625][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:38:50,420][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:38:51,217][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:38:52,820][mllm.training.trainer_common][INFO] - For task: Apply reinforce step
,
            ΔVRAM Allocated: 2.288818359375e-05 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:26,
            Percentage of VRAM taken: 69.91579512747062%,
            
[2025-08-20 15:38:53,754][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/policy_optimizer_state.pt
[2025-08-20 15:38:53,765][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/critic_optimizer_state.pt
[2025-08-20 15:38:55,134][__main__][INFO] - Iteration 480 took 55s (37.80% Gen, 62.19% Train). Generation: 20s, Training: 34s. Estimated remaining time: 7h 49m 38s. Estimated total time: 15h 18m 2s. Time estimates for 10 more iterations: 9m 10s, 100 more iterations: 1h 31m 48s, 500 more iterations: 7h 39m 1s.
[2025-08-20 15:38:55,135][__main__][INFO] - Starting iteration 480.
[2025-08-20 15:39:18,681][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 15:39:18,683][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 15:39:18,689][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 15:39:21,142][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 15:39:21,143][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 15:39:21,149][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 15:39:21,152][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data.
[2025-08-20 15:39:21,152][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets.
[2025-08-20 15:39:21,453][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:39:22,244][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:39:23,038][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:39:23,829][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:39:24,625][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:39:25,419][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:39:26,212][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:39:27,005][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:39:27,802][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:39:28,597][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:39:29,391][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:39:30,184][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:39:30,981][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:39:31,775][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:39:32,569][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:39:33,363][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:39:34,160][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:39:34,954][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:39:35,747][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:39:36,541][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:39:37,337][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:39:38,132][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:39:38,925][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:39:39,721][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:39:40,517][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:39:41,311][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:39:42,105][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:39:42,900][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:39:44,180][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:39:44,975][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:39:45,769][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:39:46,564][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:39:48,144][mllm.training.trainer_common][INFO] - For task: Apply reinforce step
,
            ΔVRAM Allocated: 2.288818359375e-05 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:26,
            Percentage of VRAM taken: 69.91579512747062%,
            
[2025-08-20 15:39:49,365][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/policy_optimizer_state.pt
[2025-08-20 15:39:49,367][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/critic_optimizer_state.pt
[2025-08-20 15:39:50,684][__main__][INFO] - Iteration 481 took 55s (37.99% Gen, 62.01% Train). Generation: 21s, Training: 34s. Estimated remaining time: 7h 56m 28s. Estimated total time: 15h 25m 48s. Time estimates for 10 more iterations: 9m 15s, 100 more iterations: 1h 32m 34s, 500 more iterations: 7h 42m 54s.
[2025-08-20 15:39:50,685][__main__][INFO] - Starting iteration 481.
[2025-08-20 15:40:13,672][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 15:40:13,673][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 15:40:13,680][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 15:40:16,151][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 15:40:16,152][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 15:40:16,159][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 15:40:16,160][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data.
[2025-08-20 15:40:16,161][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets.
[2025-08-20 15:40:16,459][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:40:17,249][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:40:18,056][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:40:18,848][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:40:19,643][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:40:20,435][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:40:21,227][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:40:22,022][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:40:22,815][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:40:23,608][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:40:24,403][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:40:25,196][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:40:25,989][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:40:26,784][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:40:27,580][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:40:28,372][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:40:29,166][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:40:29,962][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:40:30,756][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:40:32,026][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:40:32,821][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:40:33,615][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:40:34,409][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:40:35,204][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:40:36,001][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:40:36,796][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:40:37,590][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:40:38,384][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:40:39,182][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:40:39,979][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:40:40,775][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:40:41,571][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:40:43,195][mllm.training.trainer_common][INFO] - For task: Apply reinforce step
,
            ΔVRAM Allocated: 2.288818359375e-05 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:26,
            Percentage of VRAM taken: 69.91579512747062%,
            
[2025-08-20 15:40:44,118][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/policy_optimizer_state.pt
[2025-08-20 15:40:44,120][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/critic_optimizer_state.pt
[2025-08-20 15:40:45,420][__main__][INFO] - Iteration 482 took 54s (37.53% Gen, 62.47% Train). Generation: 20s, Training: 34s. Estimated remaining time: 7h 41m 59s. Estimated total time: 15h 12m 14s. Time estimates for 10 more iterations: 9m 7s, 100 more iterations: 1h 31m 13s, 500 more iterations: 7h 36m 7s.
[2025-08-20 15:40:45,421][__main__][INFO] - Starting iteration 482.
[2025-08-20 15:41:08,487][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 15:41:08,489][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 15:41:08,495][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 15:41:10,943][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 15:41:10,945][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 15:41:10,951][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 15:41:10,953][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data.
[2025-08-20 15:41:10,954][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets.
[2025-08-20 15:41:11,252][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:41:12,045][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:41:12,836][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:41:13,631][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:41:14,425][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:41:15,218][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:41:16,011][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:41:16,808][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:41:17,602][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:41:18,394][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:41:19,190][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:41:19,985][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:41:20,779][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:41:21,573][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:41:22,368][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:41:23,164][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:41:23,959][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:41:24,752][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:41:25,547][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:41:26,344][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:41:27,138][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:41:27,931][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:41:29,140][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:41:29,934][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:41:30,728][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:41:31,525][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:41:32,321][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:41:33,115][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:41:33,908][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:41:34,705][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:41:35,502][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:41:36,297][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:41:37,874][mllm.training.trainer_common][INFO] - For task: Apply reinforce step
,
            ΔVRAM Allocated: 2.288818359375e-05 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:26,
            Percentage of VRAM taken: 69.91579512747062%,
            
[2025-08-20 15:41:38,788][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/policy_optimizer_state.pt
[2025-08-20 15:41:38,790][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/critic_optimizer_state.pt
[2025-08-20 15:41:40,125][__main__][INFO] - Iteration 483 took 54s (37.71% Gen, 62.29% Train). Generation: 20s, Training: 34s. Estimated remaining time: 7h 40m 34s. Estimated total time: 15h 11m 43s. Time estimates for 10 more iterations: 9m 7s, 100 more iterations: 1h 31m 10s, 500 more iterations: 7h 35m 51s.
[2025-08-20 15:41:40,127][__main__][INFO] - Starting iteration 483.
[2025-08-20 15:42:03,489][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 15:42:03,491][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 15:42:03,497][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 15:42:05,939][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 15:42:05,940][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 15:42:05,947][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 15:42:05,949][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data.
[2025-08-20 15:42:05,949][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets.
[2025-08-20 15:42:06,250][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:42:07,039][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:42:07,834][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:42:08,628][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:42:09,420][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:42:10,213][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:42:11,009][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:42:11,801][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:42:12,594][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:42:13,390][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:42:14,185][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:42:14,978][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:42:15,772][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:42:16,568][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:42:17,361][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:42:18,154][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:42:18,951][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:42:19,748][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:42:21,024][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:42:21,818][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:42:22,611][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:42:23,407][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:42:24,202][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:42:24,995][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:42:25,789][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:42:26,586][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:42:27,382][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:42:28,175][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:42:28,969][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:42:29,766][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:42:30,563][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:42:31,357][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:42:32,953][mllm.training.trainer_common][INFO] - For task: Apply reinforce step
,
            ΔVRAM Allocated: 2.288818359375e-05 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:26,
            Percentage of VRAM taken: 69.91579512747062%,
            
[2025-08-20 15:42:33,893][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/policy_optimizer_state.pt
[2025-08-20 15:42:33,895][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/critic_optimizer_state.pt
[2025-08-20 15:42:35,253][__main__][INFO] - Iteration 484 took 55s (37.96% Gen, 62.04% Train). Generation: 20s, Training: 34s. Estimated remaining time: 7h 46m 42s. Estimated total time: 15h 18m 46s. Time estimates for 10 more iterations: 9m 11s, 100 more iterations: 1h 31m 52s, 500 more iterations: 7h 39m 23s.
[2025-08-20 15:42:35,255][__main__][INFO] - Starting iteration 484.
[2025-08-20 15:42:58,247][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 15:42:58,249][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 15:42:58,255][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 15:43:00,705][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 15:43:00,706][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 15:43:00,712][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 15:43:00,715][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data.
[2025-08-20 15:43:00,715][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets.
[2025-08-20 15:43:01,013][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:43:01,803][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:43:02,596][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:43:03,389][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:43:04,182][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:43:04,975][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:43:05,770][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:43:06,562][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:43:07,355][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:43:08,150][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:43:08,943][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:43:09,737][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:43:10,530][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:43:11,324][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:43:12,117][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:43:12,913][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:43:13,708][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:43:14,500][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:43:15,294][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:43:16,091][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:43:16,886][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:43:17,680][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:43:18,473][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:43:19,270][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:43:20,506][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:43:21,297][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:43:22,089][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:43:22,883][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:43:23,675][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:43:24,468][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:43:25,262][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:43:26,055][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:43:27,671][mllm.training.trainer_common][INFO] - For task: Apply reinforce step
,
            ΔVRAM Allocated: 2.288818359375e-05 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:26,
            Percentage of VRAM taken: 69.91579512747062%,
            
[2025-08-20 15:43:28,619][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/policy_optimizer_state.pt
[2025-08-20 15:43:28,620][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/critic_optimizer_state.pt
[2025-08-20 15:43:29,957][__main__][INFO] - Iteration 485 took 54s (37.57% Gen, 62.43% Train). Generation: 20s, Training: 34s. Estimated remaining time: 7h 38m 43s. Estimated total time: 15h 11m 42s. Time estimates for 10 more iterations: 9m 7s, 100 more iterations: 1h 31m 10s, 500 more iterations: 7h 35m 51s.
[2025-08-20 15:43:29,959][__main__][INFO] - Starting iteration 485.
[2025-08-20 15:43:53,100][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 15:43:53,101][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 15:43:53,108][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 15:43:55,563][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 15:43:55,565][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 15:43:55,571][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 15:43:55,573][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data.
[2025-08-20 15:43:55,574][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets.
[2025-08-20 15:43:55,872][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:43:56,662][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:43:57,457][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:43:58,249][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:43:59,042][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:43:59,838][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:44:00,632][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:44:01,425][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:44:02,219][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:44:03,015][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:44:03,809][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:44:04,603][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:44:05,396][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:44:06,191][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:44:06,986][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:44:07,779][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:44:08,576][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:44:09,371][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:44:10,165][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:44:10,958][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:44:11,755][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:44:12,551][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:44:13,345][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:44:14,580][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:44:15,374][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:44:16,168][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:44:16,960][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:44:17,754][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:44:18,550][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:44:19,344][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:44:20,138][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:44:20,932][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:44:22,520][mllm.training.trainer_common][INFO] - For task: Apply reinforce step
,
            ΔVRAM Allocated: 2.288818359375e-05 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:26,
            Percentage of VRAM taken: 69.91579512747062%,
            
[2025-08-20 15:44:23,473][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/policy_optimizer_state.pt
[2025-08-20 15:44:23,474][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/critic_optimizer_state.pt
[2025-08-20 15:44:24,992][__main__][INFO] - Iteration 486 took 55s (37.61% Gen, 62.39% Train). Generation: 20s, Training: 34s. Estimated remaining time: 7h 43m 19s. Estimated total time: 15h 17m 13s. Time estimates for 10 more iterations: 9m 10s, 100 more iterations: 1h 31m 43s, 500 more iterations: 7h 38m 36s.
[2025-08-20 15:44:24,994][__main__][INFO] - Starting iteration 486.
[2025-08-20 15:44:48,057][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 15:44:48,059][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 15:44:48,065][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 15:44:50,523][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 15:44:50,525][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 15:44:50,531][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 15:44:50,533][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data.
[2025-08-20 15:44:50,534][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets.
[2025-08-20 15:44:50,834][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:44:51,622][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:44:52,413][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:44:53,203][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:44:53,996][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:44:54,787][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:44:55,580][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:44:56,373][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:44:57,165][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:44:57,959][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:44:58,752][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:44:59,544][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:45:00,338][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:45:01,132][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:45:01,923][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:45:02,719][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:45:03,512][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:45:04,304][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:45:05,100][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:45:05,894][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:45:07,156][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:45:07,949][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:45:08,742][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:45:09,537][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:45:10,334][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:45:11,128][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:45:11,921][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:45:12,716][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:45:13,512][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:45:14,306][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:45:15,101][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:45:15,897][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:45:17,538][mllm.training.trainer_common][INFO] - For task: Apply reinforce step
,
            ΔVRAM Allocated: 2.288818359375e-05 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:26,
            Percentage of VRAM taken: 69.91579512747062%,
            
[2025-08-20 15:45:18,478][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/policy_optimizer_state.pt
[2025-08-20 15:45:18,480][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/critic_optimizer_state.pt
[2025-08-20 15:45:19,832][__main__][INFO] - Iteration 487 took 54s (37.59% Gen, 62.40% Train). Generation: 20s, Training: 34s. Estimated remaining time: 7h 39m 9s. Estimated total time: 15h 13m 58s. Time estimates for 10 more iterations: 9m 8s, 100 more iterations: 1h 31m 23s, 500 more iterations: 7h 36m 59s.
[2025-08-20 15:45:19,834][__main__][INFO] - Starting iteration 487.
[2025-08-20 15:45:42,907][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 15:45:42,909][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 15:45:42,915][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 15:45:45,376][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 15:45:45,377][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 15:45:45,383][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 15:45:45,385][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data.
[2025-08-20 15:45:45,386][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets.
[2025-08-20 15:45:45,684][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:45:46,474][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:45:47,266][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:45:48,062][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:45:48,854][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:45:49,647][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:45:50,443][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:45:51,238][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:45:52,031][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:45:52,825][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:45:53,621][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:45:54,416][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:45:55,209][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:45:56,004][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:45:56,798][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:45:57,591][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:45:58,384][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:45:59,182][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:45:59,978][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:46:01,304][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:46:02,100][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:46:02,895][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:46:03,688][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:46:04,481][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:46:05,278][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:46:06,073][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:46:06,865][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:46:07,659][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:46:08,456][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:46:09,252][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:46:10,046][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:46:10,840][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:46:12,485][mllm.training.trainer_common][INFO] - For task: Apply reinforce step
,
            ΔVRAM Allocated: 2.288818359375e-05 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:27,
            Percentage of VRAM taken: 69.91579512747062%,
            
[2025-08-20 15:46:13,415][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/policy_optimizer_state.pt
[2025-08-20 15:46:13,417][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/critic_optimizer_state.pt
[2025-08-20 15:46:14,752][__main__][INFO] - Iteration 488 took 54s (37.54% Gen, 62.46% Train). Generation: 20s, Training: 34s. Estimated remaining time: 7h 39m 34s. Estimated total time: 15h 15m 17s. Time estimates for 10 more iterations: 9m 9s, 100 more iterations: 1h 31m 31s, 500 more iterations: 7h 37m 38s.
[2025-08-20 15:46:14,753][__main__][INFO] - Starting iteration 488.
[2025-08-20 15:46:38,193][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 15:46:38,194][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 15:46:38,200][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 15:46:40,660][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 15:46:40,661][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 15:46:40,667][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 15:46:40,669][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data.
[2025-08-20 15:46:40,670][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets.
[2025-08-20 15:46:40,970][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:46:41,763][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:46:42,555][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:46:43,348][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:46:44,143][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:46:44,935][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:46:45,729][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:46:46,524][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:46:47,317][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:46:48,110][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:46:48,906][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:46:49,701][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:46:50,494][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:46:51,287][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:46:52,083][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:46:52,878][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:46:53,672][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:46:54,465][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:46:55,700][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:46:56,492][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:46:57,285][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:46:58,081][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:46:58,875][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:46:59,669][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:47:00,463][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:47:01,259][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:47:02,052][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:47:02,844][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:47:03,641][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:47:04,437][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:47:05,230][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:47:06,024][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:47:07,648][mllm.training.trainer_common][INFO] - For task: Apply reinforce step
,
            ΔVRAM Allocated: 2.288818359375e-05 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:26,
            Percentage of VRAM taken: 69.91579512747062%,
            
[2025-08-20 15:47:08,603][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/policy_optimizer_state.pt
[2025-08-20 15:47:08,604][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/critic_optimizer_state.pt
[2025-08-20 15:47:09,984][__main__][INFO] - Iteration 489 took 55s (37.99% Gen, 62.01% Train). Generation: 20s, Training: 34s. Estimated remaining time: 7h 43m 51s. Estimated total time: 15h 20m 29s. Time estimates for 10 more iterations: 9m 12s, 100 more iterations: 1h 32m 2s, 500 more iterations: 7h 40m 14s.
[2025-08-20 15:47:09,985][__main__][INFO] - Starting iteration 489.
[2025-08-20 15:47:33,085][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 15:47:33,087][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 15:47:33,093][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 15:47:35,572][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 15:47:35,573][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 15:47:35,579][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 15:47:35,581][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data.
[2025-08-20 15:47:35,582][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets.
[2025-08-20 15:47:35,880][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:47:36,668][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:47:37,458][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:47:38,249][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:47:39,041][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:47:39,832][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:47:40,627][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:47:41,419][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:47:42,210][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:47:43,006][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:47:43,799][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:47:44,591][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:47:45,386][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:47:46,181][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:47:46,973][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:47:47,767][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:47:48,561][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:47:49,353][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:47:50,147][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:47:50,943][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:47:51,736][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:47:53,033][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:47:53,828][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:47:54,625][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:47:55,420][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:47:56,214][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:47:57,008][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:47:57,805][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:47:58,602][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:47:59,397][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:48:00,191][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:48:00,986][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:48:02,616][mllm.training.trainer_common][INFO] - For task: Apply reinforce step
,
            ΔVRAM Allocated: 2.288818359375e-05 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:26,
            Percentage of VRAM taken: 69.91579512747062%,
            
[2025-08-20 15:48:03,540][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/policy_optimizer_state.pt
[2025-08-20 15:48:03,541][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/critic_optimizer_state.pt
[2025-08-20 15:48:04,946][__main__][INFO] - Iteration 490 took 54s (37.55% Gen, 62.45% Train). Generation: 20s, Training: 34s. Estimated remaining time: 7h 38m 26s. Estimated total time: 15h 16m 0s. Time estimates for 10 more iterations: 9m 9s, 100 more iterations: 1h 31m 36s, 500 more iterations: 7h 38m 0s.
[2025-08-20 15:48:04,948][__main__][INFO] - Starting iteration 490.
[2025-08-20 15:48:27,995][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 15:48:27,996][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 15:48:28,002][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 15:48:30,466][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 15:48:30,467][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 15:48:30,473][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 15:48:30,476][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data.
[2025-08-20 15:48:30,476][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets.
[2025-08-20 15:48:30,785][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:48:31,575][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:48:32,375][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:48:33,172][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:48:33,966][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:48:34,758][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:48:35,553][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:48:36,348][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:48:37,142][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:48:37,934][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:48:38,731][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:48:39,525][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:48:40,318][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:48:41,112][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:48:41,908][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:48:42,703][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:48:43,497][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:48:44,292][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:48:45,089][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:48:45,885][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:48:46,679][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:48:47,962][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:48:48,756][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:48:49,550][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:48:50,347][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:48:51,143][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:48:51,936][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:48:52,730][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:48:53,528][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:48:54,326][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:48:55,122][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:48:55,917][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:48:57,549][mllm.training.trainer_common][INFO] - For task: Apply reinforce step
,
            ΔVRAM Allocated: 2.288818359375e-05 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:27,
            Percentage of VRAM taken: 69.91579512747062%,
            
[2025-08-20 15:48:58,466][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/policy_optimizer_state.pt
[2025-08-20 15:48:58,468][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/critic_optimizer_state.pt
[2025-08-20 15:48:59,829][__main__][INFO] - Iteration 491 took 54s (37.51% Gen, 62.48% Train). Generation: 20s, Training: 34s. Estimated remaining time: 7h 36m 12s. Estimated total time: 15h 14m 41s. Time estimates for 10 more iterations: 9m 8s, 100 more iterations: 1h 31m 28s, 500 more iterations: 7h 37m 20s.
[2025-08-20 15:48:59,831][__main__][INFO] - Starting iteration 491.
[2025-08-20 15:49:23,278][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 15:49:23,279][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 15:49:23,286][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 15:49:25,735][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 15:49:25,736][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 15:49:25,743][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 15:49:25,745][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data.
[2025-08-20 15:49:25,745][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets.
[2025-08-20 15:49:26,044][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:49:26,835][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:49:27,627][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:49:28,418][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:49:29,213][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:49:30,004][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:49:30,796][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:49:31,589][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:49:32,380][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:49:33,174][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:49:33,967][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:49:34,759][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:49:35,553][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:49:36,346][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:49:37,139][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:49:37,933][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:49:38,726][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:49:39,518][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:49:40,312][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:49:41,107][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:49:41,899][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:49:42,693][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:49:43,489][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:49:44,283][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:49:45,075][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:49:46,302][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:49:47,095][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:49:47,891][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:49:48,688][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:49:49,482][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:49:50,276][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:49:51,071][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:49:52,679][mllm.training.trainer_common][INFO] - For task: Apply reinforce step
,
            ΔVRAM Allocated: 2.288818359375e-05 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:26,
            Percentage of VRAM taken: 69.91579512747062%,
            
[2025-08-20 15:49:53,602][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/policy_optimizer_state.pt
[2025-08-20 15:49:53,603][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/critic_optimizer_state.pt
[2025-08-20 15:49:54,974][__main__][INFO] - Iteration 492 took 55s (38.07% Gen, 61.92% Train). Generation: 20s, Training: 34s. Estimated remaining time: 7h 39m 38s. Estimated total time: 15h 19m 2s. Time estimates for 10 more iterations: 9m 11s, 100 more iterations: 1h 31m 54s, 500 more iterations: 7h 39m 31s.
[2025-08-20 15:49:54,975][__main__][INFO] - Starting iteration 492.
[2025-08-20 15:50:18,020][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 15:50:18,021][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 15:50:18,027][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 15:50:20,479][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 15:50:20,480][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 15:50:20,487][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 15:50:20,489][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data.
[2025-08-20 15:50:20,489][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets.
[2025-08-20 15:50:20,790][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:50:21,580][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:50:22,373][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:50:23,165][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:50:23,959][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:50:24,755][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:50:25,547][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:50:26,340][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:50:27,137][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:50:27,931][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:50:28,724][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:50:29,519][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:50:30,316][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:50:31,110][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:50:31,903][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:50:32,698][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:50:33,496][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:50:34,292][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:50:35,086][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:50:35,879][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:50:36,675][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:50:38,023][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:50:38,817][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:50:39,614][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:50:40,409][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:50:41,203][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:50:41,997][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:50:42,795][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:50:43,592][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:50:44,388][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:50:45,183][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:50:45,978][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:50:47,611][mllm.training.trainer_common][INFO] - For task: Apply reinforce step
,
            ΔVRAM Allocated: 2.288818359375e-05 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:27,
            Percentage of VRAM taken: 69.91579512747062%,
            
[2025-08-20 15:50:48,550][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/policy_optimizer_state.pt
[2025-08-20 15:50:48,552][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/critic_optimizer_state.pt
[2025-08-20 15:50:49,904][__main__][INFO] - Iteration 493 took 54s (37.49% Gen, 62.51% Train). Generation: 20s, Training: 34s. Estimated remaining time: 7h 35m 9s. Estimated total time: 15h 15m 28s. Time estimates for 10 more iterations: 9m 9s, 100 more iterations: 1h 31m 32s, 500 more iterations: 7h 37m 44s.
[2025-08-20 15:50:49,905][__main__][INFO] - Starting iteration 493.
[2025-08-20 15:51:13,260][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 15:51:13,261][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 15:51:13,268][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 15:51:15,706][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 15:51:15,708][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 15:51:15,714][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 15:51:15,716][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data.
[2025-08-20 15:51:15,717][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets.
[2025-08-20 15:51:16,015][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:51:16,803][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:51:17,596][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:51:18,386][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:51:19,180][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:51:19,973][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:51:20,764][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:51:21,559][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:51:22,352][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:51:23,144][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:51:23,940][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:51:24,734][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:51:25,526][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:51:26,320][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:51:27,115][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:51:27,906][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:51:28,700][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:51:29,496][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:51:30,290][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:51:31,504][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:51:32,298][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:51:33,094][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:51:33,886][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:51:34,679][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:51:35,475][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:51:36,271][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:51:37,066][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:51:37,859][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:51:38,656][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:51:39,451][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:51:40,246][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:51:41,040][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:51:42,678][mllm.training.trainer_common][INFO] - For task: Apply reinforce step
,
            ΔVRAM Allocated: 2.288818359375e-05 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:26,
            Percentage of VRAM taken: 69.91579512747062%,
            
[2025-08-20 15:51:43,595][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/policy_optimizer_state.pt
[2025-08-20 15:51:43,597][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/critic_optimizer_state.pt
[2025-08-20 15:51:44,991][__main__][INFO] - Iteration 494 took 55s (37.94% Gen, 62.06% Train). Generation: 20s, Training: 34s. Estimated remaining time: 7h 36m 51s. Estimated total time: 15h 18m 5s. Time estimates for 10 more iterations: 9m 10s, 100 more iterations: 1h 31m 48s, 500 more iterations: 7h 39m 2s.
[2025-08-20 15:51:44,993][__main__][INFO] - Starting iteration 494.
[2025-08-20 15:52:07,957][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 15:52:07,959][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 15:52:07,965][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 15:52:10,429][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 15:52:10,430][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 15:52:10,437][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 15:52:10,439][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data.
[2025-08-20 15:52:10,439][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets.
[2025-08-20 15:52:10,738][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:52:11,528][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:52:12,323][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:52:13,116][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:52:13,909][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:52:14,703][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:52:15,497][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:52:16,290][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:52:17,084][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:52:17,879][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:52:18,673][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:52:19,466][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:52:20,262][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:52:21,057][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:52:21,850][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:52:22,643][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:52:23,440][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:52:24,235][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:52:25,028][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:52:25,823][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:52:26,620][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:52:27,415][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:52:28,209][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:52:29,003][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:52:30,309][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:52:31,103][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:52:31,900][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:52:32,698][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:52:33,493][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:52:34,288][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:52:35,082][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:52:35,879][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:52:37,501][mllm.training.trainer_common][INFO] - For task: Apply reinforce step
,
            ΔVRAM Allocated: 2.288818359375e-05 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:27,
            Percentage of VRAM taken: 69.91579512747062%,
            
[2025-08-20 15:52:38,466][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/policy_optimizer_state.pt
[2025-08-20 15:52:38,467][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/critic_optimizer_state.pt
[2025-08-20 15:52:40,141][__main__][INFO] - Iteration 495 took 55s (37.19% Gen, 62.81% Train). Generation: 20s, Training: 34s. Estimated remaining time: 7h 36m 58s. Estimated total time: 15h 19m 7s. Time estimates for 10 more iterations: 9m 11s, 100 more iterations: 1h 31m 54s, 500 more iterations: 7h 39m 33s.
[2025-08-20 15:52:40,142][__main__][INFO] - Starting iteration 495.
[2025-08-20 15:53:03,283][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 15:53:03,285][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 15:53:03,291][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 15:53:05,750][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 15:53:05,751][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 15:53:05,758][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 15:53:05,760][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data.
[2025-08-20 15:53:05,760][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets.
[2025-08-20 15:53:06,058][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:53:06,847][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:53:07,640][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:53:08,430][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:53:09,225][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:53:10,018][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:53:10,811][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:53:11,606][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:53:12,400][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:53:13,192][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:53:13,986][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:53:14,782][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:53:15,577][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:53:16,370][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:53:17,166][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:53:17,962][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:53:18,755][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:53:19,548][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:53:20,344][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:53:21,141][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:53:21,935][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:53:22,728][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:53:24,019][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:53:24,812][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:53:25,606][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:53:26,402][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:53:27,199][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:53:27,992][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:53:28,786][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:53:29,583][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:53:30,380][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:53:31,176][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:53:32,795][mllm.training.trainer_common][INFO] - For task: Apply reinforce step
,
            ΔVRAM Allocated: 2.288818359375e-05 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:26,
            Percentage of VRAM taken: 69.91579512747062%,
            
[2025-08-20 15:53:34,349][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/policy_optimizer_state.pt
[2025-08-20 15:53:34,351][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/critic_optimizer_state.pt
[2025-08-20 15:53:35,940][__main__][INFO] - Iteration 496 took 55s (37.08% Gen, 62.91% Train). Generation: 20s, Training: 35s. Estimated remaining time: 7h 46m 52s. Estimated total time: 15h 29m 57s. Time estimates for 10 more iterations: 9m 17s, 100 more iterations: 1h 32m 59s, 500 more iterations: 7h 44m 58s.
[2025-08-20 15:53:35,941][__main__][INFO] - Starting iteration 496.
[2025-08-20 15:53:58,988][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 15:53:58,990][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 15:53:58,996][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 15:54:01,430][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 15:54:01,431][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 15:54:01,438][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 15:54:01,440][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data.
[2025-08-20 15:54:01,441][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets.
[2025-08-20 15:54:01,739][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:54:02,528][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:54:03,320][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:54:04,112][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:54:04,907][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:54:05,699][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:54:06,492][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:54:07,287][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:54:08,081][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:54:08,874][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:54:09,669][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:54:10,464][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:54:11,257][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:54:12,050][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:54:12,846][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:54:13,640][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:54:14,433][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:54:15,228][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:54:16,436][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:54:17,229][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:54:18,025][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:54:18,820][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:54:19,614][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:54:20,407][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:54:21,201][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:54:21,995][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:54:22,788][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:54:23,586][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:54:24,384][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:54:25,180][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:54:25,975][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:54:26,770][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:54:28,449][mllm.training.trainer_common][INFO] - For task: Apply reinforce step
,
            ΔVRAM Allocated: 2.288818359375e-05 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:26,
            Percentage of VRAM taken: 69.91579512747062%,
            
[2025-08-20 15:54:29,365][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/policy_optimizer_state.pt
[2025-08-20 15:54:29,366][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/critic_optimizer_state.pt
[2025-08-20 15:54:30,727][__main__][INFO] - Iteration 497 took 54s (37.59% Gen, 62.41% Train). Generation: 20s, Training: 34s. Estimated remaining time: 7h 29m 5s. Estimated total time: 15h 13m 4s. Time estimates for 10 more iterations: 9m 7s, 100 more iterations: 1h 31m 18s, 500 more iterations: 7h 36m 32s.
[2025-08-20 15:54:30,728][__main__][INFO] - Starting iteration 497.
[2025-08-20 15:54:53,725][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 15:54:53,726][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 15:54:53,732][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 15:54:56,190][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 15:54:56,191][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 15:54:56,198][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 15:54:56,200][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data.
[2025-08-20 15:54:56,200][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets.
[2025-08-20 15:54:56,499][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:54:57,291][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:54:58,082][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:54:58,875][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:54:59,671][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:55:00,464][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:55:01,256][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:55:02,052][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:55:02,846][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:55:03,639][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:55:04,433][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:55:05,229][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:55:06,023][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:55:06,817][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:55:07,613][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:55:08,410][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:55:09,205][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:55:10,510][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:55:11,305][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:55:12,097][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:55:12,889][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:55:13,683][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:55:14,475][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:55:15,268][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:55:16,062][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:55:16,853][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:55:17,648][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:55:18,443][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:55:19,236][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:55:20,028][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:55:20,825][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:55:21,619][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:55:23,225][mllm.training.trainer_common][INFO] - For task: Apply reinforce step
,
            ΔVRAM Allocated: 2.288818359375e-05 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:26,
            Percentage of VRAM taken: 69.91579512747062%,
            
[2025-08-20 15:55:24,142][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/policy_optimizer_state.pt
[2025-08-20 15:55:24,144][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/critic_optimizer_state.pt
[2025-08-20 15:55:25,474][__main__][INFO] - Iteration 498 took 54s (37.49% Gen, 62.50% Train). Generation: 20s, Training: 34s. Estimated remaining time: 7h 27m 31s. Estimated total time: 15h 12m 25s. Time estimates for 10 more iterations: 9m 7s, 100 more iterations: 1h 31m 14s, 500 more iterations: 7h 36m 12s.
[2025-08-20 15:55:25,476][__main__][INFO] - Starting iteration 498.
[2025-08-20 15:55:49,088][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 15:55:49,089][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 15:55:49,095][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 15:55:51,543][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 15:55:51,544][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 15:55:51,551][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 15:55:51,553][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data.
[2025-08-20 15:55:51,553][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets.
[2025-08-20 15:55:51,852][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:55:52,641][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:55:53,435][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:55:54,229][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:55:55,022][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:55:55,815][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:55:56,610][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:55:57,402][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:55:58,196][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:55:58,993][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:55:59,787][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:56:00,580][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:56:01,375][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:56:02,171][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:56:02,965][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:56:03,757][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:56:04,554][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:56:05,351][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:56:06,147][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:56:07,423][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:56:08,216][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:56:09,012][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:56:09,809][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:56:10,603][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:56:11,395][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:56:12,192][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:56:12,990][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:56:13,785][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:56:14,579][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:56:15,374][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:56:16,170][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:56:16,970][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:56:18,592][mllm.training.trainer_common][INFO] - For task: Apply reinforce step
,
            ΔVRAM Allocated: 2.288818359375e-05 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:27,
            Percentage of VRAM taken: 69.91579512747062%,
            
[2025-08-20 15:56:19,528][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/policy_optimizer_state.pt
[2025-08-20 15:56:19,530][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/critic_optimizer_state.pt
[2025-08-20 15:56:20,900][__main__][INFO] - Iteration 499 took 55s (38.19% Gen, 61.80% Train). Generation: 21s, Training: 34s. Estimated remaining time: 7h 37m 54s. Estimated total time: 15h 23m 44s. Time estimates for 10 more iterations: 9m 14s, 100 more iterations: 1h 32m 22s, 500 more iterations: 7h 41m 52s.
[2025-08-20 15:56:20,902][__main__][INFO] - Starting iteration 499.
[2025-08-20 15:56:46,391][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 15:56:46,393][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 15:56:46,399][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 15:56:48,844][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 15:56:48,845][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 15:56:48,852][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 15:56:48,854][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data.
[2025-08-20 15:56:48,855][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets.
[2025-08-20 15:56:49,153][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:56:49,943][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:56:50,735][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:56:51,526][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:56:52,317][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:56:53,109][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:56:53,901][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:56:54,694][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:56:55,485][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:56:56,277][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:56:57,071][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:56:57,862][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:56:58,656][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:56:59,451][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:57:00,244][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:57:01,036][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:57:01,833][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:57:02,626][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:57:03,418][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:57:04,214][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:57:05,008][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:57:06,217][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:57:07,015][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:57:07,813][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:57:08,608][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:57:09,402][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:57:10,197][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:57:10,994][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:57:11,791][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:57:12,588][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:57:13,383][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:57:14,178][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:57:15,834][mllm.training.trainer_common][INFO] - For task: Apply reinforce step
,
            ΔVRAM Allocated: 2.288818359375e-05 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:26,
            Percentage of VRAM taken: 69.91579512747062%,
            
[2025-08-20 15:57:16,770][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/policy_optimizer_state.pt
[2025-08-20 15:57:16,771][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/critic_optimizer_state.pt
[2025-08-20 15:57:18,104][__main__][INFO] - Iteration 500 took 57s (40.29% Gen, 59.71% Train). Generation: 23s, Training: 34s. Estimated remaining time: 8h 6m 34s. Estimated total time: 15h 53m 21s. Time estimates for 10 more iterations: 9m 32s, 100 more iterations: 1h 35m 20s, 500 more iterations: 7h 56m 40s.
[2025-08-20 15:57:18,106][__main__][INFO] - Starting iteration 500.
[2025-08-20 15:57:41,170][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 15:57:41,171][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 15:57:41,178][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 15:57:43,629][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 15:57:43,630][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 15:57:43,637][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 15:57:43,639][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data.
[2025-08-20 15:57:43,640][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets.
[2025-08-20 15:57:43,940][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:57:44,730][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:57:45,521][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:57:46,315][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:57:47,107][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:57:47,900][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:57:48,695][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:57:49,486][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:57:50,281][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:57:51,076][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:57:51,869][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:57:52,662][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:57:53,458][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:57:54,253][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:57:55,046][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:57:55,839][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:57:56,636][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:57:57,431][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:57:58,224][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:57:59,019][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:57:59,813][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:58:00,606][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:58:01,883][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:58:02,676][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:58:03,470][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:58:04,262][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:58:05,056][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:58:05,852][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:58:06,646][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:58:07,440][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:58:08,236][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:58:09,034][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:58:10,669][mllm.training.trainer_common][INFO] - For task: Apply reinforce step
,
            ΔVRAM Allocated: 2.288818359375e-05 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:26,
            Percentage of VRAM taken: 69.91579512747062%,
            
[2025-08-20 15:58:11,609][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/policy_optimizer_state.pt
[2025-08-20 15:58:11,610][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/critic_optimizer_state.pt
[2025-08-20 15:58:15,671][__main__][INFO] - Iteration 501 took 57s (35.84% Gen, 59.47% Train). Generation: 20s, Training: 34s. Estimated remaining time: 8h 11m 40s. Estimated total time: 15h 59m 24s. Time estimates for 10 more iterations: 9m 35s, 100 more iterations: 1h 35m 56s, 500 more iterations: 7h 59m 42s.
[2025-08-20 15:58:15,672][__main__][INFO] - Starting iteration 501.
[2025-08-20 15:58:38,676][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 15:58:38,678][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 15:58:38,684][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 15:58:41,163][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 15:58:41,164][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 15:58:41,171][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 15:58:41,173][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data.
[2025-08-20 15:58:41,173][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets.
[2025-08-20 15:58:41,471][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:58:42,259][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:58:43,047][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:58:43,839][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:58:44,630][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:58:45,421][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:58:46,211][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:58:47,001][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:58:47,794][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:58:48,585][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:58:49,379][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:58:50,172][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:58:50,963][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:58:51,758][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:58:52,549][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:58:53,340][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:58:54,134][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:58:54,926][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:58:55,719][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:58:56,513][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:58:57,895][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:58:58,689][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:58:59,484][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:59:00,279][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:59:01,077][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:59:01,874][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:59:02,669][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:59:03,464][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:59:04,260][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:59:05,058][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:59:05,857][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:59:06,655][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:59:08,290][mllm.training.trainer_common][INFO] - For task: Apply reinforce step
,
            ΔVRAM Allocated: 2.288818359375e-05 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:27,
            Percentage of VRAM taken: 69.91579512747062%,
            
[2025-08-20 15:59:09,220][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/policy_optimizer_state.pt
[2025-08-20 15:59:09,222][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/critic_optimizer_state.pt
[2025-08-20 15:59:10,485][__main__][INFO] - Iteration 502 took 54s (37.45% Gen, 62.55% Train). Generation: 20s, Training: 34s. Estimated remaining time: 7h 24m 52s. Estimated total time: 15h 13m 32s. Time estimates for 10 more iterations: 9m 8s, 100 more iterations: 1h 31m 21s, 500 more iterations: 7h 36m 46s.
[2025-08-20 15:59:10,486][__main__][INFO] - Starting iteration 502.
[2025-08-20 15:59:33,640][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 15:59:33,641][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 15:59:33,647][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 15:59:36,083][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 15:59:36,084][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 15:59:36,090][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 15:59:36,093][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data.
[2025-08-20 15:59:36,093][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets.
[2025-08-20 15:59:36,396][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:59:37,186][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:59:37,980][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:59:38,772][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:59:39,566][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:59:40,360][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:59:41,153][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:59:41,946][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:59:42,743][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:59:43,536][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:59:44,329][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:59:45,124][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:59:45,920][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:59:46,713][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:59:47,506][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:59:48,301][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:59:49,543][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:59:50,335][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:59:51,127][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:59:51,920][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:59:52,715][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:59:53,506][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:59:54,302][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:59:55,098][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:59:55,890][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:59:56,682][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:59:57,478][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:59:58,273][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:59:59,067][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 15:59:59,861][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:00:00,658][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:00:01,454][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:00:03,109][mllm.training.trainer_common][INFO] - For task: Apply reinforce step
,
            ΔVRAM Allocated: 2.288818359375e-05 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:26,
            Percentage of VRAM taken: 69.91579512747062%,
            
[2025-08-20 16:00:04,008][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/policy_optimizer_state.pt
[2025-08-20 16:00:04,009][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/critic_optimizer_state.pt
[2025-08-20 16:00:05,920][__main__][INFO] - Iteration 503 took 55s (37.36% Gen, 62.64% Train). Generation: 20s, Training: 34s. Estimated remaining time: 7h 34m 18s. Estimated total time: 15h 23m 53s. Time estimates for 10 more iterations: 9m 14s, 100 more iterations: 1h 32m 23s, 500 more iterations: 7h 41m 56s.
[2025-08-20 16:00:05,921][__main__][INFO] - Starting iteration 503.
[2025-08-20 16:00:29,353][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 16:00:29,354][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 16:00:29,361][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 16:00:31,810][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 16:00:31,811][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 16:00:31,817][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 16:00:31,820][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data.
[2025-08-20 16:00:31,820][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets.
[2025-08-20 16:00:32,118][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:00:32,908][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:00:33,702][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:00:34,493][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:00:35,288][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:00:36,083][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:00:36,876][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:00:37,669][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:00:38,464][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:00:39,257][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:00:40,049][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:00:40,845][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:00:41,639][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:00:42,431][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:00:43,226][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:00:44,021][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:00:44,814][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:00:45,607][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:00:46,403][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:00:47,198][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:00:47,991][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:00:48,788][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:00:49,582][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:00:50,376][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:00:51,170][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:00:51,965][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:00:53,263][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:00:54,061][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:00:54,856][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:00:55,651][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:00:56,445][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:00:57,244][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:00:58,860][mllm.training.trainer_common][INFO] - For task: Apply reinforce step
,
            ΔVRAM Allocated: 2.288818359375e-05 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:27,
            Percentage of VRAM taken: 69.91579512747062%,
            
[2025-08-20 16:00:59,773][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/policy_optimizer_state.pt
[2025-08-20 16:00:59,775][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/critic_optimizer_state.pt
[2025-08-20 16:01:01,475][__main__][INFO] - Iteration 504 took 55s (37.78% Gen, 62.22% Train). Generation: 20s, Training: 34s. Estimated remaining time: 7h 35m 22s. Estimated total time: 15h 25m 52s. Time estimates for 10 more iterations: 9m 15s, 100 more iterations: 1h 32m 35s, 500 more iterations: 7h 42m 56s.
[2025-08-20 16:01:01,476][__main__][INFO] - Starting iteration 504.
[2025-08-20 16:01:24,427][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 16:01:24,428][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 16:01:24,435][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 16:01:26,892][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 16:01:26,893][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 16:01:26,900][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 16:01:26,901][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data.
[2025-08-20 16:01:26,902][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets.
[2025-08-20 16:01:27,203][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:01:27,992][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:01:28,786][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:01:29,578][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:01:30,371][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:01:31,165][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:01:31,957][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:01:32,751][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:01:33,548][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:01:34,341][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:01:35,134][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:01:35,929][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:01:36,724][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:01:37,516][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:01:38,310][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:01:39,106][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:01:39,899][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:01:40,692][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:01:41,487][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:01:42,283][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:01:43,076][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:01:43,870][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:01:44,667][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:01:45,953][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:01:46,747][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:01:47,545][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:01:48,341][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:01:49,135][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:01:49,929][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:01:50,726][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:01:51,524][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:01:52,322][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:01:53,957][mllm.training.trainer_common][INFO] - For task: Apply reinforce step
,
            ΔVRAM Allocated: 2.288818359375e-05 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:27,
            Percentage of VRAM taken: 69.91579512747062%,
            
[2025-08-20 16:01:54,886][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/policy_optimizer_state.pt
[2025-08-20 16:01:54,888][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/critic_optimizer_state.pt
[2025-08-20 16:01:56,162][__main__][INFO] - Iteration 505 took 54s (37.50% Gen, 62.50% Train). Generation: 20s, Training: 34s. Estimated remaining time: 7h 19m 59s. Estimated total time: 15h 11m 25s. Time estimates for 10 more iterations: 9m 6s, 100 more iterations: 1h 31m 8s, 500 more iterations: 7h 35m 42s.
[2025-08-20 16:01:56,163][__main__][INFO] - Starting iteration 505.
[2025-08-20 16:02:19,259][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 16:02:19,260][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 16:02:19,267][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 16:02:21,708][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 16:02:21,709][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 16:02:21,716][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 16:02:21,718][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data.
[2025-08-20 16:02:21,718][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets.
[2025-08-20 16:02:22,017][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:02:22,810][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:02:23,601][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:02:24,394][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:02:25,187][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:02:25,979][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:02:26,773][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:02:27,568][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:02:28,360][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:02:29,153][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:02:29,950][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:02:30,742][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:02:31,535][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:02:32,331][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:02:33,126][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:02:33,919][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:02:34,713][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:02:35,509][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:02:36,304][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:02:37,096][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:02:38,350][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:02:39,144][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:02:39,936][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:02:40,731][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:02:41,528][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:02:42,323][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:02:43,117][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:02:43,911][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:02:44,708][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:02:45,503][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:02:46,297][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:02:47,091][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:02:48,715][mllm.training.trainer_common][INFO] - For task: Apply reinforce step
,
            ΔVRAM Allocated: 2.288818359375e-05 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:26,
            Percentage of VRAM taken: 69.91579512747062%,
            
[2025-08-20 16:02:49,642][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/policy_optimizer_state.pt
[2025-08-20 16:02:49,644][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/critic_optimizer_state.pt
[2025-08-20 16:02:51,017][__main__][INFO] - Iteration 506 took 54s (37.62% Gen, 62.38% Train). Generation: 20s, Training: 34s. Estimated remaining time: 7h 21m 53s. Estimated total time: 15h 14m 13s. Time estimates for 10 more iterations: 9m 8s, 100 more iterations: 1h 31m 25s, 500 more iterations: 7h 37m 6s.
[2025-08-20 16:02:51,018][__main__][INFO] - Starting iteration 506.
[2025-08-20 16:03:14,049][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 16:03:14,050][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 16:03:14,057][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 16:03:16,512][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 16:03:16,513][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 16:03:16,520][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 16:03:16,522][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data.
[2025-08-20 16:03:16,523][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets.
[2025-08-20 16:03:16,821][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:03:17,614][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:03:18,406][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:03:19,198][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:03:19,993][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:03:20,785][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:03:21,578][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:03:22,373][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:03:23,167][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:03:23,959][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:03:24,755][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:03:25,549][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:03:26,341][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:03:27,137][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:03:27,934][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:03:28,729][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:03:29,523][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:03:30,831][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:03:31,625][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:03:32,419][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:03:33,214][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:03:34,011][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:03:34,807][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:03:35,602][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:03:36,396][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:03:37,194][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:03:37,992][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:03:38,789][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:03:39,585][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:03:40,381][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:03:41,177][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:03:41,973][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:03:43,584][mllm.training.trainer_common][INFO] - For task: Apply reinforce step
,
            ΔVRAM Allocated: 2.288818359375e-05 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:27,
            Percentage of VRAM taken: 69.91579512747062%,
            
[2025-08-20 16:03:44,598][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/policy_optimizer_state.pt
[2025-08-20 16:03:44,600][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/critic_optimizer_state.pt
[2025-08-20 16:03:45,949][__main__][INFO] - Iteration 507 took 54s (37.47% Gen, 62.53% Train). Generation: 20s, Training: 34s. Estimated remaining time: 7h 22m 15s. Estimated total time: 15h 15m 30s. Time estimates for 10 more iterations: 9m 9s, 100 more iterations: 1h 31m 33s, 500 more iterations: 7h 37m 45s.
[2025-08-20 16:03:45,951][__main__][INFO] - Starting iteration 507.
[2025-08-20 16:04:08,942][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 16:04:08,944][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 16:04:08,950][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 16:04:11,414][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 16:04:11,415][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 16:04:11,421][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 16:04:11,424][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data.
[2025-08-20 16:04:11,424][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets.
[2025-08-20 16:04:11,725][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:04:12,517][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:04:13,308][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:04:14,101][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:04:14,896][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:04:15,689][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:04:16,481][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:04:17,277][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:04:18,070][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:04:18,863][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:04:19,656][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:04:20,450][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:04:21,243][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:04:22,038][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:04:22,834][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:04:23,627][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:04:24,420][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:04:25,217][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:04:26,013][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:04:26,808][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:04:28,212][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:04:29,005][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:04:29,801][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:04:30,597][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:04:31,396][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:04:32,194][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:04:32,991][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:04:33,786][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:04:34,581][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:04:35,378][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:04:36,175][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:04:36,974][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:04:38,604][mllm.training.trainer_common][INFO] - For task: Apply reinforce step
,
            ΔVRAM Allocated: 2.288818359375e-05 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:27,
            Percentage of VRAM taken: 69.91579512747062%,
            
[2025-08-20 16:04:39,525][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/policy_optimizer_state.pt
[2025-08-20 16:04:39,527][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/critic_optimizer_state.pt
[2025-08-20 16:04:40,904][__main__][INFO] - Iteration 508 took 54s (37.38% Gen, 62.62% Train). Generation: 20s, Training: 34s. Estimated remaining time: 7h 21m 42s. Estimated total time: 15h 15m 52s. Time estimates for 10 more iterations: 9m 9s, 100 more iterations: 1h 31m 35s, 500 more iterations: 7h 37m 56s.
[2025-08-20 16:04:40,905][__main__][INFO] - Starting iteration 508.
[2025-08-20 16:05:04,178][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 16:05:04,180][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 16:05:04,186][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 16:05:06,649][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 16:05:06,650][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 16:05:06,657][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 16:05:06,659][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data.
[2025-08-20 16:05:06,660][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets.
[2025-08-20 16:05:06,957][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:05:07,745][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:05:08,538][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:05:09,329][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:05:10,122][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:05:10,916][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:05:11,706][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:05:12,500][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:05:13,293][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:05:14,084][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:05:14,879][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:05:15,672][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:05:16,463][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:05:17,258][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:05:18,051][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:05:18,843][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:05:19,638][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:05:20,431][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:05:21,225][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:05:22,020][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:05:23,278][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:05:24,072][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:05:24,866][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:05:25,661][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:05:26,458][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:05:27,256][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:05:28,053][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:05:28,847][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:05:29,643][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:05:30,440][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:05:31,237][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:05:32,036][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:05:33,710][mllm.training.trainer_common][INFO] - For task: Apply reinforce step
,
            ΔVRAM Allocated: 2.288818359375e-05 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:27,
            Percentage of VRAM taken: 69.91579512747062%,
            
[2025-08-20 16:05:35,730][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/policy_optimizer_state.pt
[2025-08-20 16:05:35,732][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/critic_optimizer_state.pt
[2025-08-20 16:05:37,512][__main__][INFO] - Iteration 509 took 56s (36.77% Gen, 63.22% Train). Generation: 20s, Training: 35s. Estimated remaining time: 7h 48m 19s. Estimated total time: 15h 43m 26s. Time estimates for 10 more iterations: 9m 26s, 100 more iterations: 1h 34m 20s, 500 more iterations: 7h 51m 43s.
[2025-08-20 16:05:37,514][__main__][INFO] - Starting iteration 509.
[2025-08-20 16:06:00,552][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 16:06:00,554][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 16:06:00,560][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 16:06:03,011][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 16:06:03,012][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 16:06:03,019][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 16:06:03,021][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data.
[2025-08-20 16:06:03,022][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets.
[2025-08-20 16:06:03,321][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:06:04,110][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:06:04,904][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:06:05,697][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:06:06,488][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:06:07,283][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:06:08,077][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:06:08,869][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:06:09,664][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:06:10,458][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:06:11,251][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:06:12,044][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:06:12,839][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:06:13,631][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:06:14,425][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:06:15,221][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:06:16,015][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:06:16,809][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:06:17,603][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:06:18,400][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:06:19,694][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:06:20,487][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:06:21,283][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:06:22,080][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:06:22,876][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:06:23,671][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:06:24,465][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:06:25,261][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:06:26,059][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:06:26,856][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:06:27,651][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:06:28,446][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:06:30,062][mllm.training.trainer_common][INFO] - For task: Apply reinforce step
,
            ΔVRAM Allocated: 2.288818359375e-05 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:27,
            Percentage of VRAM taken: 69.91579512747062%,
            
[2025-08-20 16:06:38,917][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/policy_optimizer_state.pt
[2025-08-20 16:06:38,919][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/critic_optimizer_state.pt
[2025-08-20 16:06:40,424][__main__][INFO] - Iteration 510 took 1m 2s (32.73% Gen, 67.27% Train). Generation: 20s, Training: 42s. Estimated remaining time: 9h 32m 21s. Estimated total time: 17h 28m 30s. Time estimates for 10 more iterations: 10m 29s, 100 more iterations: 1h 44m 51s, 500 more iterations: 8h 44m 15s.
[2025-08-20 16:06:40,426][__main__][INFO] - Starting iteration 510.
[2025-08-20 16:07:03,230][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 16:07:03,231][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 16:07:03,237][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 16:07:05,696][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 16:07:05,697][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 16:07:05,704][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 16:07:05,706][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data.
[2025-08-20 16:07:05,707][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets.
[2025-08-20 16:07:06,003][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:07:06,789][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:07:07,580][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:07:08,369][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:07:09,161][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:07:09,950][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:07:10,741][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:07:11,531][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:07:12,322][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:07:13,113][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:07:13,903][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:07:14,695][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:07:15,486][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:07:16,280][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:07:17,072][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:07:17,863][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:07:18,657][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:07:19,449][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:07:20,243][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:07:21,037][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:07:21,829][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:07:22,622][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:07:23,929][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:07:24,723][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:07:25,521][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:07:26,318][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:07:27,114][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:07:27,907][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:07:28,702][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:07:29,500][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:07:30,297][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:07:31,093][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:07:32,724][mllm.training.trainer_common][INFO] - For task: Apply reinforce step
,
            ΔVRAM Allocated: 2.288818359375e-05 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:26,
            Percentage of VRAM taken: 69.91579512747062%,
            
[2025-08-20 16:07:33,644][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/policy_optimizer_state.pt
[2025-08-20 16:07:33,645][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/critic_optimizer_state.pt
[2025-08-20 16:07:34,913][__main__][INFO] - Iteration 511 took 54s (37.33% Gen, 62.66% Train). Generation: 20s, Training: 34s. Estimated remaining time: 7h 11m 3s. Estimated total time: 15h 8m 6s. Time estimates for 10 more iterations: 9m 4s, 100 more iterations: 1h 30m 48s, 500 more iterations: 7h 34m 3s.
[2025-08-20 16:07:34,915][__main__][INFO] - Starting iteration 511.
[2025-08-20 16:07:57,972][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 16:07:57,974][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 16:07:57,980][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 16:08:00,468][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 16:08:00,469][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 16:08:00,476][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 16:08:00,478][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data.
[2025-08-20 16:08:00,479][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets.
[2025-08-20 16:08:00,777][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:08:01,566][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:08:02,358][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:08:03,149][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:08:03,944][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:08:04,736][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:08:05,529][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:08:06,326][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:08:07,120][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:08:07,912][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:08:08,707][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:08:09,502][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:08:10,296][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:08:11,090][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:08:11,886][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:08:12,683][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:08:13,477][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:08:14,711][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:08:15,507][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:08:16,304][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:08:17,100][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:08:17,894][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:08:18,688][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:08:19,485][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:08:20,284][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:08:21,081][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:08:21,876][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:08:22,670][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:08:23,465][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:08:24,263][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:08:25,062][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:08:25,858][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:08:27,557][mllm.training.trainer_common][INFO] - For task: Apply reinforce step
,
            ΔVRAM Allocated: 2.288818359375e-05 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:27,
            Percentage of VRAM taken: 69.91579512747062%,
            
[2025-08-20 16:08:28,485][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/policy_optimizer_state.pt
[2025-08-20 16:08:28,486][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/critic_optimizer_state.pt
[2025-08-20 16:08:29,897][__main__][INFO] - Iteration 512 took 54s (37.44% Gen, 62.56% Train). Generation: 20s, Training: 34s. Estimated remaining time: 7h 18m 22s. Estimated total time: 15h 16m 21s. Time estimates for 10 more iterations: 9m 9s, 100 more iterations: 1h 31m 38s, 500 more iterations: 7h 38m 10s.
[2025-08-20 16:08:29,898][__main__][INFO] - Starting iteration 512.
[2025-08-20 16:08:53,002][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 16:08:53,003][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 16:08:53,010][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 16:08:55,485][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 16:08:55,486][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 16:08:55,492][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 16:08:55,494][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data.
[2025-08-20 16:08:55,495][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets.
[2025-08-20 16:08:55,793][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:08:56,586][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:08:57,376][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:08:58,169][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:08:58,962][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:08:59,755][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:09:00,550][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:09:01,346][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:09:02,138][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:09:02,932][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:09:03,728][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:09:04,524][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:09:05,318][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:09:06,112][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:09:06,909][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:09:07,706][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:09:08,501][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:09:09,297][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:09:10,092][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:09:10,888][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:09:12,141][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:09:12,935][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:09:13,730][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:09:14,528][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:09:15,326][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:09:16,123][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:09:16,919][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:09:17,714][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:09:18,509][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:09:19,305][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:09:20,103][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:09:20,899][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:09:22,512][mllm.training.trainer_common][INFO] - For task: Apply reinforce step
,
            ΔVRAM Allocated: 2.288818359375e-05 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:26,
            Percentage of VRAM taken: 69.91579512747062%,
            
[2025-08-20 16:09:23,415][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/policy_optimizer_state.pt
[2025-08-20 16:09:23,417][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/critic_optimizer_state.pt
[2025-08-20 16:09:24,772][__main__][INFO] - Iteration 513 took 54s (37.63% Gen, 62.36% Train). Generation: 20s, Training: 34s. Estimated remaining time: 7h 15m 39s. Estimated total time: 15h 14m 33s. Time estimates for 10 more iterations: 9m 8s, 100 more iterations: 1h 31m 27s, 500 more iterations: 7h 37m 16s.
[2025-08-20 16:09:24,773][__main__][INFO] - Starting iteration 513.
[2025-08-20 16:09:47,868][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 16:09:47,875][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 16:09:47,886][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 16:09:50,367][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 16:09:50,369][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 16:09:50,375][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 16:09:50,377][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data.
[2025-08-20 16:09:50,378][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets.
[2025-08-20 16:09:50,705][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:09:51,494][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:09:52,288][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:09:53,080][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:09:53,874][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:09:54,668][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:09:55,462][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:09:56,255][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:09:57,052][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:09:57,846][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:09:58,639][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:09:59,434][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:10:00,232][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:10:01,028][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:10:01,823][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:10:02,618][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:10:03,413][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:10:04,211][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:10:05,007][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:10:05,802][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:10:06,597][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:10:07,393][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:10:08,700][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:10:09,495][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:10:10,292][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:10:11,090][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:10:11,888][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:10:12,685][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:10:13,480][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:10:14,275][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:10:15,071][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:10:15,869][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:10:17,512][mllm.training.trainer_common][INFO] - For task: Apply reinforce step
,
            ΔVRAM Allocated: 2.288818359375e-05 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:27,
            Percentage of VRAM taken: 69.91579512747062%,
            
[2025-08-20 16:10:18,461][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/policy_optimizer_state.pt
[2025-08-20 16:10:18,462][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/critic_optimizer_state.pt
[2025-08-20 16:10:20,886][__main__][INFO] - Iteration 514 took 56s (36.77% Gen, 63.23% Train). Generation: 20s, Training: 35s. Estimated remaining time: 7h 35m 22s. Estimated total time: 15h 35m 12s. Time estimates for 10 more iterations: 9m 21s, 100 more iterations: 1h 33m 31s, 500 more iterations: 7h 47m 36s.
[2025-08-20 16:10:20,887][__main__][INFO] - Starting iteration 514.
[2025-08-20 16:10:44,095][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 16:10:44,097][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 16:10:44,103][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 16:10:46,540][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 16:10:46,542][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 16:10:46,548][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 16:10:46,550][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data.
[2025-08-20 16:10:46,551][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets.
[2025-08-20 16:10:46,849][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:10:47,637][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:10:48,428][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:10:49,217][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:10:50,011][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:10:50,802][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:10:51,596][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:10:52,389][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:10:53,181][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:10:53,974][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:10:54,769][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:10:55,561][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:10:56,356][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:10:57,153][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:10:57,947][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:10:58,740][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:10:59,535][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:11:00,332][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:11:01,625][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:11:02,419][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:11:03,215][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:11:04,013][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:11:04,811][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:11:05,607][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:11:06,403][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:11:07,197][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:11:07,994][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:11:08,792][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:11:09,590][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:11:10,386][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:11:11,182][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:11:11,976][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:11:13,628][mllm.training.trainer_common][INFO] - For task: Apply reinforce step
,
            ΔVRAM Allocated: 2.288818359375e-05 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:27,
            Percentage of VRAM taken: 69.91579512747062%,
            
[2025-08-20 16:11:14,569][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/policy_optimizer_state.pt
[2025-08-20 16:11:14,570][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/critic_optimizer_state.pt
[2025-08-20 16:11:15,921][__main__][INFO] - Iteration 515 took 55s (37.75% Gen, 62.25% Train). Generation: 20s, Training: 34s. Estimated remaining time: 7h 16m 28s. Estimated total time: 15h 17m 13s. Time estimates for 10 more iterations: 9m 10s, 100 more iterations: 1h 31m 43s, 500 more iterations: 7h 38m 36s.
[2025-08-20 16:11:15,923][__main__][INFO] - Starting iteration 515.
[2025-08-20 16:11:38,933][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 16:11:38,935][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 16:11:38,941][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 16:11:41,382][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 16:11:41,384][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 16:11:41,390][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 16:11:41,392][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data.
[2025-08-20 16:11:41,393][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets.
[2025-08-20 16:11:41,693][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:11:42,482][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:11:43,277][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:11:44,069][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:11:44,862][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:11:45,658][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:11:46,453][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:11:47,246][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:11:48,039][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:11:48,835][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:11:49,629][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:11:50,422][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:11:51,219][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:11:52,017][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:11:52,813][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:11:53,608][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:11:54,403][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:11:55,199][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:11:55,997][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:11:56,795][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:11:57,592][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:11:58,387][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:11:59,182][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:11:59,978][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:12:00,776][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:12:01,575][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:12:02,373][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:12:03,718][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:12:04,515][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:12:05,314][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:12:06,112][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:12:06,909][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:12:08,532][mllm.training.trainer_common][INFO] - For task: Apply reinforce step
,
            ΔVRAM Allocated: 2.288818359375e-05 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:27,
            Percentage of VRAM taken: 69.91579512747062%,
            
[2025-08-20 16:12:09,457][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/policy_optimizer_state.pt
[2025-08-20 16:12:09,459][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/critic_optimizer_state.pt
[2025-08-20 16:12:11,175][__main__][INFO] - Iteration 516 took 55s (37.24% Gen, 62.76% Train). Generation: 20s, Training: 34s. Estimated remaining time: 7h 19m 11s. Estimated total time: 15h 20m 51s. Time estimates for 10 more iterations: 9m 12s, 100 more iterations: 1h 32m 5s, 500 more iterations: 7h 40m 25s.
[2025-08-20 16:12:11,176][__main__][INFO] - Starting iteration 516.
[2025-08-20 16:12:33,994][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 16:12:33,995][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 16:12:34,002][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 16:12:36,460][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 16:12:36,461][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 16:12:36,467][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 16:12:36,470][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data.
[2025-08-20 16:12:36,470][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets.
[2025-08-20 16:12:36,768][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:12:37,561][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:12:38,353][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:12:39,145][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:12:39,941][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:12:40,734][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:12:41,527][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:12:42,324][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:12:43,121][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:12:43,915][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:12:44,709][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:12:45,502][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:12:46,301][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:12:47,098][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:12:47,894][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:12:48,688][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:12:49,482][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:12:50,280][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:12:51,079][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:12:51,877][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:12:52,673][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:12:54,079][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:12:54,875][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:12:55,668][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:12:56,461][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:12:57,257][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:12:58,052][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:12:58,846][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:12:59,639][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:13:00,437][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:13:01,233][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:13:02,028][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:13:03,672][mllm.training.trainer_common][INFO] - For task: Apply reinforce step
,
            ΔVRAM Allocated: 2.288818359375e-05 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:27,
            Percentage of VRAM taken: 69.91579512747062%,
            
[2025-08-20 16:13:04,620][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/policy_optimizer_state.pt
[2025-08-20 16:13:04,621][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/critic_optimizer_state.pt
[2025-08-20 16:13:05,906][__main__][INFO] - Iteration 517 took 54s (37.23% Gen, 62.76% Train). Generation: 20s, Training: 34s. Estimated remaining time: 7h 9m 34s. Estimated total time: 15h 12m 9s. Time estimates for 10 more iterations: 9m 7s, 100 more iterations: 1h 31m 12s, 500 more iterations: 7h 36m 4s.
[2025-08-20 16:13:05,908][__main__][INFO] - Starting iteration 517.
[2025-08-20 16:13:28,765][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 16:13:28,766][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 16:13:28,773][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 16:13:31,212][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 16:13:31,213][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 16:13:31,220][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 16:13:31,222][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data.
[2025-08-20 16:13:31,222][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets.
[2025-08-20 16:13:31,520][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:13:32,309][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:13:33,103][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:13:33,897][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:13:34,689][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:13:35,483][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:13:36,278][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:13:37,072][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:13:37,866][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:13:38,663][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:13:39,460][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:13:40,254][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:13:41,048][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:13:41,844][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:13:42,642][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:13:43,439][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:13:44,233][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:13:45,466][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:13:46,262][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:13:47,060][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:13:47,856][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:13:48,651][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:13:49,445][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:13:50,242][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:13:51,041][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:13:51,838][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:13:52,634][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:13:53,429][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:13:54,223][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:13:55,020][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:13:55,819][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:13:56,617][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:13:58,258][mllm.training.trainer_common][INFO] - For task: Apply reinforce step
,
            ΔVRAM Allocated: 2.288818359375e-05 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:27,
            Percentage of VRAM taken: 69.91579512747062%,
            
[2025-08-20 16:13:59,190][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/policy_optimizer_state.pt
[2025-08-20 16:13:59,191][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/critic_optimizer_state.pt
[2025-08-20 16:14:00,595][__main__][INFO] - Iteration 518 took 54s (37.33% Gen, 62.66% Train). Generation: 20s, Training: 34s. Estimated remaining time: 7h 7m 56s. Estimated total time: 15h 11m 26s. Time estimates for 10 more iterations: 9m 6s, 100 more iterations: 1h 31m 8s, 500 more iterations: 7h 35m 43s.
[2025-08-20 16:14:00,596][__main__][INFO] - Starting iteration 518.
[2025-08-20 16:14:23,711][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 16:14:23,712][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 16:14:23,718][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 16:14:26,193][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 16:14:26,195][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 16:14:26,201][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 16:14:26,203][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data.
[2025-08-20 16:14:26,204][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets.
[2025-08-20 16:14:26,503][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:14:27,292][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:14:28,087][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:14:28,881][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:14:29,673][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:14:30,467][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:14:31,264][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:14:32,057][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:14:32,850][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:14:33,646][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:14:34,443][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:14:35,237][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:14:36,030][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:14:36,826][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:14:37,624][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:14:38,420][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:14:39,215][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:14:40,010][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:14:40,805][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:14:41,603][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:14:42,399][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:14:43,193][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:14:43,987][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:14:45,237][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:14:46,030][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:14:46,824][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:14:47,621][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:14:48,418][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:14:49,212][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:14:50,006][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:14:50,803][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:14:51,601][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:14:53,210][mllm.training.trainer_common][INFO] - For task: Apply reinforce step
,
            ΔVRAM Allocated: 2.288818359375e-05 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:26,
            Percentage of VRAM taken: 69.91579512747062%,
            
[2025-08-20 16:14:54,168][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/policy_optimizer_state.pt
[2025-08-20 16:14:54,169][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/critic_optimizer_state.pt
[2025-08-20 16:14:55,597][__main__][INFO] - Iteration 519 took 55s (37.53% Gen, 62.47% Train). Generation: 20s, Training: 34s. Estimated remaining time: 7h 12m 15s. Estimated total time: 15h 16m 40s. Time estimates for 10 more iterations: 9m 10s, 100 more iterations: 1h 31m 40s, 500 more iterations: 7h 38m 20s.
[2025-08-20 16:14:55,598][__main__][INFO] - Starting iteration 519.
[2025-08-20 16:15:18,783][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 16:15:18,785][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 16:15:18,791][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 16:15:21,236][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 16:15:21,237][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 16:15:21,244][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 16:15:21,246][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data.
[2025-08-20 16:15:21,246][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets.
[2025-08-20 16:15:21,544][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:15:22,332][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:15:23,126][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:15:23,917][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:15:24,711][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:15:25,505][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:15:26,297][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:15:27,090][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:15:27,886][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:15:28,679][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:15:29,473][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:15:30,269][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:15:31,065][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:15:31,858][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:15:32,652][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:15:33,450][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:15:34,249][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:15:35,047][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:15:35,842][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:15:36,636][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:15:38,036][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:15:38,829][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:15:39,626][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:15:40,421][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:15:41,215][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:15:42,008][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:15:42,805][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:15:43,600][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:15:44,394][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:15:45,189][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:15:45,986][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:15:46,783][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:15:48,426][mllm.training.trainer_common][INFO] - For task: Apply reinforce step
,
            ΔVRAM Allocated: 2.288818359375e-05 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:27,
            Percentage of VRAM taken: 69.91579512747062%,
            
[2025-08-20 16:15:49,361][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/policy_optimizer_state.pt
[2025-08-20 16:15:49,362][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/critic_optimizer_state.pt
[2025-08-20 16:15:50,642][__main__][INFO] - Iteration 520 took 55s (37.69% Gen, 62.31% Train). Generation: 20s, Training: 34s. Estimated remaining time: 7h 12m 3s. Estimated total time: 15h 17m 23s. Time estimates for 10 more iterations: 9m 10s, 100 more iterations: 1h 31m 44s, 500 more iterations: 7h 38m 41s.
[2025-08-20 16:15:50,644][__main__][INFO] - Starting iteration 520.
[2025-08-20 16:16:13,595][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 16:16:13,596][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 16:16:13,602][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 16:16:16,044][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 16:16:16,046][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 16:16:16,052][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 16:16:16,054][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data.
[2025-08-20 16:16:16,055][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets.
[2025-08-20 16:16:16,352][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:16:17,142][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:16:17,934][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:16:18,727][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:16:19,518][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:16:20,312][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:16:21,108][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:16:21,900][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:16:22,693][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:16:23,488][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:16:24,283][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:16:25,077][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:16:25,873][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:16:26,671][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:16:27,467][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:16:28,262][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:16:29,056][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:16:29,853][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:16:30,648][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:16:31,444][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:16:32,238][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:16:33,032][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:16:34,268][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:16:35,064][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:16:35,859][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:16:36,653][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:16:37,450][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:16:38,248][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:16:39,046][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:16:39,842][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:16:40,637][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:16:41,433][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:16:43,091][mllm.training.trainer_common][INFO] - For task: Apply reinforce step
,
            ΔVRAM Allocated: 2.288818359375e-05 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:27,
            Percentage of VRAM taken: 69.91579512747062%,
            
[2025-08-20 16:16:44,035][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/policy_optimizer_state.pt
[2025-08-20 16:16:44,036][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/critic_optimizer_state.pt
[2025-08-20 16:16:45,471][__main__][INFO] - Iteration 521 took 54s (37.39% Gen, 62.61% Train). Generation: 20s, Training: 34s. Estimated remaining time: 7h 7m 32s. Estimated total time: 15h 13m 47s. Time estimates for 10 more iterations: 9m 8s, 100 more iterations: 1h 31m 22s, 500 more iterations: 7h 36m 53s.
[2025-08-20 16:16:45,473][__main__][INFO] - Starting iteration 521.
[2025-08-20 16:17:08,336][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 16:17:08,338][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 16:17:08,344][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 16:17:10,776][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 16:17:10,777][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 16:17:10,784][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 16:17:10,786][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data.
[2025-08-20 16:17:10,786][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets.
[2025-08-20 16:17:11,088][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:17:11,877][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:17:12,670][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:17:13,461][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:17:14,255][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:17:15,050][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:17:15,843][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:17:16,637][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:17:17,433][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:17:18,226][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:17:19,020][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:17:19,815][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:17:20,613][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:17:21,409][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:17:22,203][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:17:22,997][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:17:23,795][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:17:24,594][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:17:25,391][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:17:26,188][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:17:26,982][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:17:27,777][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:17:28,574][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:17:29,847][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:17:30,641][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:17:31,435][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:17:32,231][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:17:33,028][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:17:33,823][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:17:34,618][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:17:35,413][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:17:36,211][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:17:37,826][mllm.training.trainer_common][INFO] - For task: Apply reinforce step
,
            ΔVRAM Allocated: 2.288818359375e-05 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:27,
            Percentage of VRAM taken: 69.91579512747062%,
            
[2025-08-20 16:17:38,799][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/policy_optimizer_state.pt
[2025-08-20 16:17:38,801][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/critic_optimizer_state.pt
[2025-08-20 16:17:40,196][__main__][INFO] - Iteration 522 took 54s (37.31% Gen, 62.69% Train). Generation: 20s, Training: 34s. Estimated remaining time: 7h 4m 53s. Estimated total time: 15h 12m 2s. Time estimates for 10 more iterations: 9m 7s, 100 more iterations: 1h 31m 12s, 500 more iterations: 7h 36m 1s.
[2025-08-20 16:17:40,197][__main__][INFO] - Starting iteration 522.
[2025-08-20 16:18:03,071][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 16:18:03,072][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 16:18:03,078][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 16:18:05,508][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 16:18:05,510][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 16:18:05,516][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 16:18:05,518][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data.
[2025-08-20 16:18:05,519][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets.
[2025-08-20 16:18:05,816][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:18:06,605][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:18:07,397][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:18:08,189][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:18:08,980][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:18:09,774][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:18:10,567][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:18:11,359][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:18:12,155][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:18:12,948][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:18:13,741][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:18:14,537][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:18:15,332][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:18:16,125][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:18:16,919][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:18:17,713][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:18:18,506][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:18:19,299][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:18:20,096][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:18:20,892][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:18:22,181][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:18:22,975][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:18:23,776][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:18:24,575][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:18:25,372][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:18:26,168][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:18:26,962][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:18:27,757][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:18:28,554][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:18:29,353][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:18:30,150][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:18:30,946][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:18:32,592][mllm.training.trainer_common][INFO] - For task: Apply reinforce step
,
            ΔVRAM Allocated: 2.288818359375e-05 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:27,
            Percentage of VRAM taken: 69.91579512747062%,
            
[2025-08-20 16:18:33,512][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/policy_optimizer_state.pt
[2025-08-20 16:18:33,513][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/critic_optimizer_state.pt
[2025-08-20 16:18:35,001][__main__][INFO] - Iteration 523 took 54s (37.27% Gen, 62.73% Train). Generation: 20s, Training: 34s. Estimated remaining time: 7h 5m 18s. Estimated total time: 15h 13m 22s. Time estimates for 10 more iterations: 9m 8s, 100 more iterations: 1h 31m 20s, 500 more iterations: 7h 36m 41s.
[2025-08-20 16:18:35,002][__main__][INFO] - Starting iteration 523.
[2025-08-20 16:18:58,161][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 16:18:58,162][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 16:18:58,169][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 16:19:00,615][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 16:19:00,616][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 16:19:00,623][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 16:19:00,625][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data.
[2025-08-20 16:19:00,625][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets.
[2025-08-20 16:19:00,923][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:19:01,714][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:19:02,506][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:19:03,301][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:19:04,096][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:19:04,890][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:19:05,683][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:19:06,479][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:19:07,273][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:19:08,067][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:19:08,862][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:19:09,660][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:19:10,455][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:19:11,249][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:19:12,043][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:19:12,841][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:19:13,638][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:19:14,434][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:19:15,228][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:19:16,023][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:19:16,820][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:19:17,618][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:19:18,416][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:19:19,212][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:19:20,007][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:19:20,801][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:19:21,598][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:19:22,826][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:19:23,621][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:19:24,418][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:19:25,216][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:19:26,014][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:19:27,630][mllm.training.trainer_common][INFO] - For task: Apply reinforce step
,
            ΔVRAM Allocated: 2.288818359375e-05 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:26,
            Percentage of VRAM taken: 69.91579512747062%,
            
[2025-08-20 16:19:28,550][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/policy_optimizer_state.pt
[2025-08-20 16:19:28,552][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/critic_optimizer_state.pt
[2025-08-20 16:19:29,925][__main__][INFO] - Iteration 524 took 54s (37.71% Gen, 62.29% Train). Generation: 20s, Training: 34s. Estimated remaining time: 7h 6m 23s. Estimated total time: 15h 15m 22s. Time estimates for 10 more iterations: 9m 9s, 100 more iterations: 1h 31m 32s, 500 more iterations: 7h 37m 41s.
[2025-08-20 16:19:29,926][__main__][INFO] - Starting iteration 524.
[2025-08-20 16:19:52,881][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 16:19:52,883][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 16:19:52,889][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 16:19:55,340][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 16:19:55,341][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 16:19:55,348][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 16:19:55,350][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data.
[2025-08-20 16:19:55,351][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets.
[2025-08-20 16:19:55,652][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:19:56,443][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:19:57,234][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:19:58,026][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:19:58,822][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:19:59,615][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:20:00,408][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:20:01,203][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:20:01,999][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:20:02,792][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:20:03,587][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:20:04,383][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:20:05,179][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:20:05,974][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:20:06,767][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:20:07,564][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:20:08,361][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:20:09,666][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:20:10,459][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:20:11,254][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:20:12,049][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:20:12,844][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:20:13,642][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:20:14,439][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:20:15,235][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:20:16,030][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:20:16,824][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:20:17,621][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:20:18,419][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:20:19,217][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:20:20,014][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:20:20,808][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:20:22,412][mllm.training.trainer_common][INFO] - For task: Apply reinforce step
,
            ΔVRAM Allocated: 2.288818359375e-05 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:27,
            Percentage of VRAM taken: 69.91579512747062%,
            
[2025-08-20 16:20:23,331][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/policy_optimizer_state.pt
[2025-08-20 16:20:23,332][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/critic_optimizer_state.pt
[2025-08-20 16:20:24,840][__main__][INFO] - Iteration 525 took 54s (37.34% Gen, 62.66% Train). Generation: 20s, Training: 34s. Estimated remaining time: 7h 5m 19s. Estimated total time: 15h 15m 12s. Time estimates for 10 more iterations: 9m 9s, 100 more iterations: 1h 31m 31s, 500 more iterations: 7h 37m 36s.
[2025-08-20 16:20:24,841][__main__][INFO] - Starting iteration 525.
[2025-08-20 16:20:47,680][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 16:20:47,681][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 16:20:47,687][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 16:20:50,139][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 16:20:50,141][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 16:20:50,147][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 16:20:50,149][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data.
[2025-08-20 16:20:50,150][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets.
[2025-08-20 16:20:50,448][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:20:51,237][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:20:52,027][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:20:52,819][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:20:53,612][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:20:54,406][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:20:55,199][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:20:55,991][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:20:56,786][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:20:57,580][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:20:58,373][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:20:59,167][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:20:59,964][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:21:00,758][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:21:01,551][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:21:02,346][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:21:03,142][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:21:03,935][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:21:04,729][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:21:05,525][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:21:06,322][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:21:07,117][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:21:08,425][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:21:09,218][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:21:10,012][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:21:10,806][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:21:11,603][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:21:12,401][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:21:13,198][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:21:13,993][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:21:14,787][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:21:15,584][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:21:17,204][mllm.training.trainer_common][INFO] - For task: Apply reinforce step
,
            ΔVRAM Allocated: 2.288818359375e-05 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:27,
            Percentage of VRAM taken: 69.91579512747062%,
            
[2025-08-20 16:21:18,217][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/policy_optimizer_state.pt
[2025-08-20 16:21:18,219][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/critic_optimizer_state.pt
[2025-08-20 16:21:19,555][__main__][INFO] - Iteration 526 took 54s (37.30% Gen, 62.70% Train). Generation: 20s, Training: 34s. Estimated remaining time: 7h 1m 4s. Estimated total time: 15h 11m 53s. Time estimates for 10 more iterations: 9m 7s, 100 more iterations: 1h 31m 11s, 500 more iterations: 7h 35m 56s.
[2025-08-20 16:21:19,556][__main__][INFO] - Starting iteration 526.
[2025-08-20 16:21:42,857][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 16:21:42,859][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 16:21:42,865][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 16:21:45,296][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 16:21:45,297][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 16:21:45,304][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 16:21:45,306][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data.
[2025-08-20 16:21:45,306][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets.
[2025-08-20 16:21:45,610][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:21:46,400][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:21:47,192][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:21:47,990][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:21:48,784][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:21:49,576][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:21:50,372][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:21:51,170][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:21:51,965][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:21:52,759][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:21:53,553][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:21:54,350][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:21:55,147][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:21:55,944][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:21:56,738][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:21:57,532][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:21:58,328][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:21:59,126][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:21:59,922][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:22:00,716][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:22:01,511][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:22:02,308][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:22:03,536][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:22:04,329][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:22:05,127][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:22:05,925][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:22:06,723][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:22:07,518][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:22:08,312][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:22:09,108][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:22:09,906][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:22:10,705][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:22:12,274][mllm.training.trainer_common][INFO] - For task: Apply reinforce step
,
            ΔVRAM Allocated: 2.288818359375e-05 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:26,
            Percentage of VRAM taken: 69.91579512747062%,
            
[2025-08-20 16:22:13,251][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/policy_optimizer_state.pt
[2025-08-20 16:22:13,253][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/critic_optimizer_state.pt
[2025-08-20 16:22:14,547][__main__][INFO] - Iteration 527 took 54s (37.94% Gen, 62.05% Train). Generation: 20s, Training: 34s. Estimated remaining time: 7h 4m 46s. Estimated total time: 15h 16m 30s. Time estimates for 10 more iterations: 9m 9s, 100 more iterations: 1h 31m 39s, 500 more iterations: 7h 38m 15s.
[2025-08-20 16:22:14,548][__main__][INFO] - Starting iteration 527.
[2025-08-20 16:22:37,484][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 16:22:37,485][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 16:22:37,492][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 16:22:39,932][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 16:22:39,933][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 16:22:39,939][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 16:22:39,941][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data.
[2025-08-20 16:22:39,942][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets.
[2025-08-20 16:22:40,242][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:22:41,032][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:22:41,826][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:22:42,618][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:22:43,413][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:22:44,208][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:22:45,001][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:22:45,795][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:22:46,592][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:22:47,388][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:22:48,182][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:22:48,976][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:22:49,773][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:22:50,571][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:22:51,366][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:22:52,160][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:22:53,448][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:22:54,242][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:22:55,037][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:22:55,832][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:22:56,630][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:22:57,428][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:22:58,225][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:22:59,019][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:22:59,814][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:23:00,610][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:23:01,409][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:23:02,208][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:23:03,005][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:23:03,799][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:23:04,595][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:23:05,390][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:23:06,978][mllm.training.trainer_common][INFO] - For task: Apply reinforce step
,
            ΔVRAM Allocated: 2.288818359375e-05 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:26,
            Percentage of VRAM taken: 69.91579512747062%,
            
[2025-08-20 16:23:07,907][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/policy_optimizer_state.pt
[2025-08-20 16:23:07,909][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/critic_optimizer_state.pt
[2025-08-20 16:23:09,176][__main__][INFO] - Iteration 528 took 54s (37.52% Gen, 62.47% Train). Generation: 20s, Training: 34s. Estimated remaining time: 6h 57m 49s. Estimated total time: 15h 10m 27s. Time estimates for 10 more iterations: 9m 6s, 100 more iterations: 1h 31m 2s, 500 more iterations: 7h 35m 13s.
[2025-08-20 16:23:09,177][__main__][INFO] - Starting iteration 528.
[2025-08-20 16:23:32,448][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 16:23:32,450][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 16:23:32,456][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 16:23:34,904][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 16:23:34,905][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 16:23:34,912][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 16:23:34,914][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data.
[2025-08-20 16:23:34,915][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets.
[2025-08-20 16:23:35,218][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:23:36,009][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:23:36,801][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:23:37,595][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:23:38,388][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:23:39,180][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:23:39,976][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:23:40,773][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:23:41,567][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:23:42,359][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:23:43,156][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:23:43,952][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:23:44,747][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:23:45,542][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:23:46,336][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:23:47,133][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:23:47,929][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:23:48,724][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:23:49,517][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:23:50,314][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:23:51,652][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:23:52,447][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:23:53,241][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:23:54,036][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:23:54,833][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:23:55,631][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:23:56,429][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:23:57,224][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:23:58,019][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:23:58,814][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:23:59,612][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:24:00,409][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:24:02,019][mllm.training.trainer_common][INFO] - For task: Apply reinforce step
,
            ΔVRAM Allocated: 2.288818359375e-05 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:27,
            Percentage of VRAM taken: 69.91579512747062%,
            
[2025-08-20 16:24:02,923][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/policy_optimizer_state.pt
[2025-08-20 16:24:02,925][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/critic_optimizer_state.pt
[2025-08-20 16:24:04,332][__main__][INFO] - Iteration 529 took 55s (37.76% Gen, 62.24% Train). Generation: 20s, Training: 34s. Estimated remaining time: 7h 5m 40s. Estimated total time: 15h 19m 13s. Time estimates for 10 more iterations: 9m 11s, 100 more iterations: 1h 31m 55s, 500 more iterations: 7h 39m 36s.
[2025-08-20 16:24:04,333][__main__][INFO] - Starting iteration 529.
[2025-08-20 16:24:27,491][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 16:24:27,493][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 16:24:27,499][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 16:24:29,939][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 16:24:29,940][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 16:24:29,947][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 16:24:29,949][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data.
[2025-08-20 16:24:29,950][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets.
[2025-08-20 16:24:30,249][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:24:31,039][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:24:31,832][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:24:32,624][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:24:33,419][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:24:34,214][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:24:35,006][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:24:35,800][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:24:36,597][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:24:37,392][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:24:38,185][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:24:38,979][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:24:39,776][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:24:40,572][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:24:41,366][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:24:42,160][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:24:42,957][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:24:43,754][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:24:44,550][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:24:45,344][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:24:46,138][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:24:46,932][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:24:48,184][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:24:48,977][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:24:49,775][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:24:50,573][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:24:51,369][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:24:52,164][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:24:52,958][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:24:53,755][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:24:54,552][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:24:55,346][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:24:56,910][mllm.training.trainer_common][INFO] - For task: Apply reinforce step
,
            ΔVRAM Allocated: 2.288818359375e-05 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:26,
            Percentage of VRAM taken: 69.91579512747062%,
            
[2025-08-20 16:24:57,846][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/policy_optimizer_state.pt
[2025-08-20 16:24:57,847][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/critic_optimizer_state.pt
[2025-08-20 16:24:59,192][__main__][INFO] - Iteration 530 took 54s (37.75% Gen, 62.25% Train). Generation: 20s, Training: 34s. Estimated remaining time: 6h 59m 50s. Estimated total time: 15h 14m 18s. Time estimates for 10 more iterations: 9m 8s, 100 more iterations: 1h 31m 25s, 500 more iterations: 7h 37m 9s.
[2025-08-20 16:24:59,194][__main__][INFO] - Starting iteration 530.
[2025-08-20 16:25:22,092][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 16:25:22,093][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 16:25:22,099][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 16:25:24,567][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 16:25:24,569][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 16:25:24,575][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 16:25:24,578][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data.
[2025-08-20 16:25:24,578][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets.
[2025-08-20 16:25:24,879][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:25:25,668][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:25:26,459][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:25:27,250][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:25:28,043][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:25:28,838][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:25:29,632][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:25:30,424][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:25:31,221][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:25:32,016][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:25:32,809][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:25:33,603][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:25:34,400][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:25:35,195][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:25:35,988][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:25:36,782][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:25:37,580][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:25:38,376][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:25:39,171][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:25:40,488][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:25:41,282][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:25:42,079][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:25:42,877][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:25:43,673][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:25:44,468][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:25:45,263][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:25:46,059][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:25:46,857][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:25:47,655][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:25:48,451][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:25:49,244][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:25:50,038][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:25:51,611][mllm.training.trainer_common][INFO] - For task: Apply reinforce step
,
            ΔVRAM Allocated: 2.288818359375e-05 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:26,
            Percentage of VRAM taken: 69.91579512747062%,
            
[2025-08-20 16:25:52,530][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/policy_optimizer_state.pt
[2025-08-20 16:25:52,531][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/critic_optimizer_state.pt
[2025-08-20 16:25:53,885][__main__][INFO] - Iteration 531 took 54s (37.38% Gen, 62.62% Train). Generation: 20s, Training: 34s. Estimated remaining time: 6h 56m 8s. Estimated total time: 15h 11m 30s. Time estimates for 10 more iterations: 9m 6s, 100 more iterations: 1h 31m 9s, 500 more iterations: 7h 35m 45s.
[2025-08-20 16:25:53,887][__main__][INFO] - Starting iteration 531.
[2025-08-20 16:26:16,724][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 16:26:16,725][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 16:26:16,732][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 16:26:19,169][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 16:26:19,171][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 16:26:19,177][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 16:26:19,180][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data.
[2025-08-20 16:26:19,181][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets.
[2025-08-20 16:26:19,480][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:26:20,270][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:26:21,064][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:26:21,858][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:26:22,651][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:26:23,446][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:26:24,243][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:26:25,037][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:26:25,831][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:26:26,626][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:26:27,421][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:26:28,216][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:26:29,010][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:26:29,805][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:26:30,603][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:26:31,399][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:26:32,195][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:26:32,990][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:26:33,785][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:26:34,582][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:26:35,380][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:26:36,658][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:26:37,454][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:26:38,249][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:26:39,044][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:26:39,841][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:26:40,640][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:26:41,437][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:26:42,232][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:26:43,025][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:26:43,820][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:26:44,617][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:26:46,221][mllm.training.trainer_common][INFO] - For task: Apply reinforce step
,
            ΔVRAM Allocated: 2.288818359375e-05 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:27,
            Percentage of VRAM taken: 69.91579512747062%,
            
[2025-08-20 16:26:47,150][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/policy_optimizer_state.pt
[2025-08-20 16:26:47,151][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/critic_optimizer_state.pt
[2025-08-20 16:26:48,508][__main__][INFO] - Iteration 532 took 54s (37.36% Gen, 62.64% Train). Generation: 20s, Training: 34s. Estimated remaining time: 6h 54m 3s. Estimated total time: 15h 10m 21s. Time estimates for 10 more iterations: 9m 6s, 100 more iterations: 1h 31m 2s, 500 more iterations: 7h 35m 10s.
[2025-08-20 16:26:48,510][__main__][INFO] - Starting iteration 532.
[2025-08-20 16:27:11,378][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 16:27:11,379][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 16:27:11,385][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 16:27:13,834][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 16:27:13,835][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 16:27:13,841][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 16:27:13,844][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data.
[2025-08-20 16:27:13,844][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets.
[2025-08-20 16:27:14,145][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:27:14,933][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:27:15,726][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:27:16,517][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:27:17,309][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:27:18,104][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:27:18,897][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:27:19,690][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:27:20,486][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:27:21,281][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:27:22,075][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:27:22,868][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:27:23,665][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:27:24,461][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:27:25,254][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:27:26,048][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:27:26,846][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:27:27,643][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:27:28,886][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:27:29,681][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:27:30,476][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:27:31,271][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:27:32,067][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:27:32,864][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:27:33,663][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:27:34,461][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:27:35,257][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:27:36,052][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:27:36,847][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:27:37,643][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:27:38,440][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:27:39,235][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:27:40,821][mllm.training.trainer_common][INFO] - For task: Apply reinforce step
,
            ΔVRAM Allocated: 2.288818359375e-05 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:26,
            Percentage of VRAM taken: 69.91579512747062%,
            
[2025-08-20 16:27:41,729][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/policy_optimizer_state.pt
[2025-08-20 16:27:41,731][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/critic_optimizer_state.pt
[2025-08-20 16:27:43,102][__main__][INFO] - Iteration 533 took 54s (37.42% Gen, 62.58% Train). Generation: 20s, Training: 34s. Estimated remaining time: 6h 52m 40s. Estimated total time: 15h 9m 52s. Time estimates for 10 more iterations: 9m 5s, 100 more iterations: 1h 30m 59s, 500 more iterations: 7h 34m 56s.
[2025-08-20 16:27:43,104][__main__][INFO] - Starting iteration 533.
[2025-08-20 16:28:06,229][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 16:28:06,230][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 16:28:06,236][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 16:28:08,683][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 16:28:08,684][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 16:28:08,691][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 16:28:08,693][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data.
[2025-08-20 16:28:08,694][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets.
[2025-08-20 16:28:08,994][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:28:09,786][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:28:10,579][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:28:11,372][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:28:12,169][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:28:12,963][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:28:13,756][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:28:14,551][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:28:15,347][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:28:16,142][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:28:16,936][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:28:17,731][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:28:18,528][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:28:19,325][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:28:20,119][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:28:20,913][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:28:21,709][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:28:22,507][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:28:23,303][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:28:24,098][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:28:24,892][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:28:25,688][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:28:26,998][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:28:27,791][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:28:28,587][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:28:29,386][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:28:30,182][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:28:30,977][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:28:31,770][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:28:32,564][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:28:33,358][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:28:34,150][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:28:35,735][mllm.training.trainer_common][INFO] - For task: Apply reinforce step
,
            ΔVRAM Allocated: 2.288818359375e-05 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:27,
            Percentage of VRAM taken: 69.91579512747062%,
            
[2025-08-20 16:28:36,670][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/policy_optimizer_state.pt
[2025-08-20 16:28:36,671][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/critic_optimizer_state.pt
[2025-08-20 16:28:37,890][__main__][INFO] - Iteration 534 took 54s (37.77% Gen, 62.23% Train). Generation: 20s, Training: 34s. Estimated remaining time: 6h 54m 58s. Estimated total time: 15h 13m 5s. Time estimates for 10 more iterations: 9m 7s, 100 more iterations: 1h 31m 18s, 500 more iterations: 7h 36m 32s.
[2025-08-20 16:28:37,892][__main__][INFO] - Starting iteration 534.
[2025-08-20 16:29:00,844][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 16:29:00,845][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 16:29:00,851][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 16:29:03,305][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 16:29:03,307][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 16:29:03,313][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 16:29:03,316][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data.
[2025-08-20 16:29:03,316][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets.
[2025-08-20 16:29:03,619][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:29:04,411][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:29:05,203][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:29:05,995][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:29:06,791][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:29:07,585][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:29:08,378][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:29:09,174][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:29:09,970][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:29:10,764][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:29:11,557][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:29:12,353][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:29:13,150][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:29:13,944][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:29:14,740][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:29:15,534][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:29:16,332][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:29:17,129][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:29:17,926][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:29:18,721][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:29:19,516][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:29:20,770][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:29:21,562][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:29:22,355][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:29:23,150][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:29:23,945][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:29:24,738][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:29:25,531][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:29:26,326][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:29:27,118][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:29:27,909][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:29:28,704][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:29:30,306][mllm.training.trainer_common][INFO] - For task: Apply reinforce step
,
            ΔVRAM Allocated: 2.288818359375e-05 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:26,
            Percentage of VRAM taken: 69.91579512747062%,
            
[2025-08-20 16:29:31,221][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/policy_optimizer_state.pt
[2025-08-20 16:29:31,223][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/critic_optimizer_state.pt
[2025-08-20 16:29:32,600][__main__][INFO] - Iteration 535 took 54s (37.47% Gen, 62.53% Train). Generation: 20s, Training: 34s. Estimated remaining time: 6h 52m 46s. Estimated total time: 15h 11m 48s. Time estimates for 10 more iterations: 9m 7s, 100 more iterations: 1h 31m 10s, 500 more iterations: 7h 35m 54s.
[2025-08-20 16:29:32,602][__main__][INFO] - Starting iteration 535.
[2025-08-20 16:29:55,428][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 16:29:55,429][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 16:29:55,436][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 16:29:57,903][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 16:29:57,905][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 16:29:57,911][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 16:29:57,913][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data.
[2025-08-20 16:29:57,914][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets.
[2025-08-20 16:29:58,212][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:29:59,002][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:29:59,798][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:30:00,593][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:30:01,387][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:30:02,181][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:30:02,977][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:30:03,774][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:30:04,570][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:30:05,364][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:30:06,159][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:30:06,956][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:30:07,753][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:30:08,550][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:30:09,345][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:30:10,139][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:30:10,936][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:30:11,734][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:30:12,531][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:30:13,326][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:30:14,121][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:30:15,358][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:30:16,154][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:30:16,953][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:30:17,750][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:30:18,547][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:30:19,342][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:30:20,137][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:30:20,932][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:30:21,729][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:30:22,523][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:30:23,317][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:30:24,896][mllm.training.trainer_common][INFO] - For task: Apply reinforce step
,
            ΔVRAM Allocated: 2.288818359375e-05 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:26,
            Percentage of VRAM taken: 69.91579512747062%,
            
[2025-08-20 16:30:25,864][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/policy_optimizer_state.pt
[2025-08-20 16:30:25,865][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/critic_optimizer_state.pt
[2025-08-20 16:30:27,226][__main__][INFO] - Iteration 536 took 54s (37.28% Gen, 62.72% Train). Generation: 20s, Training: 34s. Estimated remaining time: 6h 50m 28s. Estimated total time: 15h 10m 24s. Time estimates for 10 more iterations: 9m 6s, 100 more iterations: 1h 31m 2s, 500 more iterations: 7h 35m 12s.
[2025-08-20 16:30:27,228][__main__][INFO] - Starting iteration 536.
[2025-08-20 16:30:50,122][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 16:30:50,124][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 16:30:50,130][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 16:30:52,591][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 16:30:52,592][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 16:30:52,598][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 16:30:52,601][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data.
[2025-08-20 16:30:52,601][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets.
[2025-08-20 16:30:52,901][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:30:53,694][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:30:54,487][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:30:55,281][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:30:56,077][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:30:56,870][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:30:57,664][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:30:58,460][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:30:59,258][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:31:00,053][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:31:00,847][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:31:01,641][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:31:02,436][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:31:03,230][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:31:04,025][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:31:04,820][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:31:05,618][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:31:06,414][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:31:07,209][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:31:08,513][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:31:09,308][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:31:10,102][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:31:10,898][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:31:11,695][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:31:12,492][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:31:13,288][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:31:14,082][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:31:14,876][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:31:15,672][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:31:16,466][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:31:17,259][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:31:18,054][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:31:19,620][mllm.training.trainer_common][INFO] - For task: Apply reinforce step
,
            ΔVRAM Allocated: 2.288818359375e-05 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:26,
            Percentage of VRAM taken: 69.91579512747062%,
            
[2025-08-20 16:31:20,534][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/policy_optimizer_state.pt
[2025-08-20 16:31:20,535][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/critic_optimizer_state.pt
[2025-08-20 16:31:22,006][__main__][INFO] - Iteration 537 took 54s (37.32% Gen, 62.68% Train). Generation: 20s, Training: 34s. Estimated remaining time: 6h 52m 6s. Estimated total time: 15h 12m 57s. Time estimates for 10 more iterations: 9m 7s, 100 more iterations: 1h 31m 17s, 500 more iterations: 7h 36m 28s.
[2025-08-20 16:31:22,007][__main__][INFO] - Starting iteration 537.
[2025-08-20 16:31:44,999][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 16:31:45,000][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 16:31:45,007][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 16:31:47,461][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 16:31:47,462][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 16:31:47,469][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 16:31:47,471][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data.
[2025-08-20 16:31:47,472][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets.
[2025-08-20 16:31:47,771][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:31:48,562][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:31:49,356][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:31:50,148][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:31:50,943][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:31:51,740][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:31:52,534][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:31:53,328][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:31:54,124][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:31:54,921][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:31:55,717][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:31:56,510][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:31:57,305][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:31:58,101][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:31:58,898][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:31:59,693][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:32:00,488][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:32:01,285][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:32:02,082][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:32:02,879][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:32:03,675][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:32:04,469][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:32:05,730][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:32:06,524][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:32:07,321][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:32:08,119][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:32:08,917][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:32:09,712][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:32:10,505][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:32:11,298][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:32:12,095][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:32:12,890][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:32:14,522][mllm.training.trainer_common][INFO] - For task: Apply reinforce step
,
            ΔVRAM Allocated: 2.288818359375e-05 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:27,
            Percentage of VRAM taken: 69.91579512747062%,
            
[2025-08-20 16:32:15,426][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/policy_optimizer_state.pt
[2025-08-20 16:32:15,427][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/critic_optimizer_state.pt
[2025-08-20 16:32:16,740][__main__][INFO] - Iteration 538 took 54s (37.56% Gen, 62.44% Train). Generation: 20s, Training: 34s. Estimated remaining time: 6h 50m 26s. Estimated total time: 15h 12m 12s. Time estimates for 10 more iterations: 9m 7s, 100 more iterations: 1h 31m 13s, 500 more iterations: 7h 36m 6s.
[2025-08-20 16:32:16,742][__main__][INFO] - Starting iteration 538.
[2025-08-20 16:32:40,033][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 16:32:40,034][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 16:32:40,041][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 16:32:42,505][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 16:32:42,507][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 16:32:42,513][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 16:32:42,515][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data.
[2025-08-20 16:32:42,516][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets.
[2025-08-20 16:32:42,815][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:32:43,606][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:32:44,401][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:32:45,194][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:32:45,988][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:32:46,784][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:32:47,579][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:32:48,373][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:32:49,167][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:32:49,964][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:32:50,761][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:32:51,555][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:32:52,349][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:32:53,143][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:32:53,940][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:32:54,734][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:32:55,529][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:32:56,324][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:32:57,122][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:32:57,918][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:32:58,713][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:32:59,961][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:33:00,758][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:33:01,552][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:33:02,347][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:33:03,143][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:33:03,939][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:33:04,734][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:33:05,527][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:33:06,321][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:33:07,118][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:33:07,913][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:33:09,481][mllm.training.trainer_common][INFO] - For task: Apply reinforce step
,
            ΔVRAM Allocated: 2.288818359375e-05 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:26,
            Percentage of VRAM taken: 69.91579512747062%,
            
[2025-08-20 16:33:10,502][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/policy_optimizer_state.pt
[2025-08-20 16:33:10,505][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/critic_optimizer_state.pt
[2025-08-20 16:33:11,930][__main__][INFO] - Iteration 539 took 55s (37.73% Gen, 62.27% Train). Generation: 20s, Training: 34s. Estimated remaining time: 6h 57m 6s. Estimated total time: 15h 19m 47s. Time estimates for 10 more iterations: 9m 11s, 100 more iterations: 1h 31m 58s, 500 more iterations: 7h 39m 53s.
[2025-08-20 16:33:11,932][__main__][INFO] - Starting iteration 539.
[2025-08-20 16:33:34,998][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 16:33:35,000][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 16:33:35,006][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 16:33:37,484][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 16:33:37,485][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 16:33:37,492][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 16:33:37,676][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data.
[2025-08-20 16:33:37,677][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets.
[2025-08-20 16:33:37,979][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:33:38,769][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:33:39,564][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:33:40,358][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:33:41,151][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:33:41,948][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:33:42,744][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:33:43,537][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:33:44,331][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:33:45,128][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:33:45,925][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:33:46,720][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:33:47,515][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:33:48,451][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:33:49,247][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:33:50,043][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:33:54,937][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:33:56,309][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:33:57,103][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:33:57,898][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:33:58,689][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:33:59,482][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:34:00,277][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:34:01,070][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:34:02,388][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:34:03,181][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:34:03,976][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:34:04,768][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:34:05,562][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:34:06,358][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:34:07,151][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:34:07,944][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:34:09,515][mllm.training.trainer_common][INFO] - For task: Apply reinforce step
,
            ΔVRAM Allocated: 2.288818359375e-05 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:31,
            Percentage of VRAM taken: 69.91579512747062%,
            
[2025-08-20 16:34:10,425][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/policy_optimizer_state.pt
[2025-08-20 16:34:10,426][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/critic_optimizer_state.pt
[2025-08-20 16:34:11,714][__main__][INFO] - Iteration 540 took 59s (34.43% Gen, 65.56% Train). Generation: 20s, Training: 39s. Estimated remaining time: 8h 12m 41s. Estimated total time: 16h 36m 22s. Time estimates for 10 more iterations: 9m 57s, 100 more iterations: 1h 39m 38s, 500 more iterations: 8h 18m 11s.
[2025-08-20 16:34:11,716][__main__][INFO] - Starting iteration 540.
[2025-08-20 16:34:35,141][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 16:34:35,142][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 16:34:35,148][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 16:34:37,614][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 16:34:37,615][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 16:34:37,622][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 16:34:37,624][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data.
[2025-08-20 16:34:37,625][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets.
[2025-08-20 16:34:37,922][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:34:38,711][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:34:39,505][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:34:40,297][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:34:41,090][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:34:41,884][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:34:42,677][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:34:43,469][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:34:44,265][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:34:45,057][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:34:45,851][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:34:46,648][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:34:47,444][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:34:48,238][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:34:49,031][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:34:49,827][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:34:50,623][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:34:51,417][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:34:52,657][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:34:53,451][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:34:54,246][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:34:55,041][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:34:55,833][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:34:56,627][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:34:57,424][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:34:58,219][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:34:59,012][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:34:59,806][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:35:00,604][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:35:01,399][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:35:02,193][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:35:02,987][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:35:04,589][mllm.training.trainer_common][INFO] - For task: Apply reinforce step
,
            ΔVRAM Allocated: 2.288818359375e-05 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:26,
            Percentage of VRAM taken: 69.91579512747062%,
            
[2025-08-20 16:35:05,511][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/policy_optimizer_state.pt
[2025-08-20 16:35:05,512][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/critic_optimizer_state.pt
[2025-08-20 16:35:06,840][__main__][INFO] - Iteration 541 took 55s (38.03% Gen, 61.97% Train). Generation: 20s, Training: 34s. Estimated remaining time: 6h 54m 8s. Estimated total time: 15h 18m 43s. Time estimates for 10 more iterations: 9m 11s, 100 more iterations: 1h 31m 52s, 500 more iterations: 7h 39m 21s.
[2025-08-20 16:35:06,842][__main__][INFO] - Starting iteration 541.
[2025-08-20 16:35:30,488][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 16:35:30,489][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 16:35:30,496][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 16:35:32,964][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 16:35:32,965][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 16:35:32,971][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 16:35:32,974][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data.
[2025-08-20 16:35:32,974][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets.
[2025-08-20 16:35:33,273][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:35:34,064][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:35:34,857][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:35:35,652][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:35:36,448][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:35:37,241][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:35:38,035][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:35:38,830][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:35:39,626][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:35:40,419][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:35:41,214][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:35:42,011][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:35:42,807][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:35:43,602][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:35:44,396][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:35:45,192][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:35:46,410][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:35:47,206][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:35:47,999][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:35:48,792][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:35:49,587][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:35:50,382][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:35:51,176][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:35:51,970][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:35:52,767][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:35:53,562][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:35:54,355][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:35:55,149][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:35:55,949][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:35:56,746][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:35:57,542][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:35:58,336][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:35:59,932][mllm.training.trainer_common][INFO] - For task: Apply reinforce step
,
            ΔVRAM Allocated: 2.288818359375e-05 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:26,
            Percentage of VRAM taken: 69.91579512747062%,
            
[2025-08-20 16:36:00,877][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/policy_optimizer_state.pt
[2025-08-20 16:36:00,878][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/critic_optimizer_state.pt
[2025-08-20 16:36:02,339][__main__][INFO] - Iteration 542 took 55s (38.15% Gen, 61.84% Train). Generation: 21s, Training: 34s. Estimated remaining time: 6h 59m 25s. Estimated total time: 15h 24m 57s. Time estimates for 10 more iterations: 9m 14s, 100 more iterations: 1h 32m 29s, 500 more iterations: 7h 42m 28s.
[2025-08-20 16:36:02,341][__main__][INFO] - Starting iteration 542.
[2025-08-20 16:36:25,401][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 16:36:25,402][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 16:36:25,409][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 16:36:27,894][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 16:36:27,895][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 16:36:27,901][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 16:36:27,904][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data.
[2025-08-20 16:36:27,904][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets.
[2025-08-20 16:36:28,205][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:36:28,996][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:36:29,790][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:36:30,583][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:36:31,377][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:36:32,173][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:36:32,968][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:36:33,761][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:36:34,556][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:36:35,352][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:36:36,147][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:36:36,942][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:36:37,737][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:36:38,534][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:36:39,331][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:36:40,128][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:36:40,922][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:36:41,715][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:36:43,060][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:36:43,853][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:36:44,650][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:36:45,444][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:36:46,237][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:36:47,032][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:36:47,829][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:36:48,624][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:36:49,418][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:36:50,211][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:36:51,009][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:36:51,804][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:36:52,598][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:36:53,392][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:36:55,007][mllm.training.trainer_common][INFO] - For task: Apply reinforce step
,
            ΔVRAM Allocated: 2.288818359375e-05 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:27,
            Percentage of VRAM taken: 69.91579512747062%,
            
[2025-08-20 16:36:55,938][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/policy_optimizer_state.pt
[2025-08-20 16:36:55,939][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/critic_optimizer_state.pt
[2025-08-20 16:36:57,313][__main__][INFO] - Iteration 543 took 54s (37.46% Gen, 62.54% Train). Generation: 20s, Training: 34s. Estimated remaining time: 6h 49m 45s. Estimated total time: 15h 16m 12s. Time estimates for 10 more iterations: 9m 9s, 100 more iterations: 1h 31m 37s, 500 more iterations: 7h 38m 6s.
[2025-08-20 16:36:57,315][__main__][INFO] - Starting iteration 543.
[2025-08-20 16:37:24,698][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 16:37:24,699][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 16:37:24,706][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 16:37:27,182][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 16:37:27,184][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 16:37:27,190][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 16:37:27,193][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data.
[2025-08-20 16:37:27,193][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets.
[2025-08-20 16:37:27,491][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:37:28,281][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:37:29,072][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:37:29,864][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:37:30,658][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:37:31,453][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:37:32,246][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:37:33,039][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:37:33,836][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:37:34,631][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:37:35,425][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:37:36,219][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:37:37,014][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:37:37,808][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:37:38,600][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:37:39,394][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:37:40,188][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:37:40,980][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:37:41,775][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:37:42,572][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:37:43,367][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:37:44,160][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:37:44,954][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:37:45,751][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:37:46,547][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:37:47,774][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:37:48,571][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:37:49,367][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:37:50,161][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:37:50,955][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:37:51,751][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:37:52,548][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:37:54,157][mllm.training.trainer_common][INFO] - For task: Apply reinforce step
,
            ΔVRAM Allocated: 2.288818359375e-05 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:26,
            Percentage of VRAM taken: 69.91579512747062%,
            
[2025-08-20 16:37:55,146][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/policy_optimizer_state.pt
[2025-08-20 16:37:55,148][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/critic_optimizer_state.pt
[2025-08-20 16:37:56,522][__main__][INFO] - Iteration 544 took 59s (42.10% Gen, 57.90% Train). Generation: 24s, Training: 34s. Estimated remaining time: 7h 59m 21s. Estimated total time: 16h 26m 46s. Time estimates for 10 more iterations: 9m 52s, 100 more iterations: 1h 38m 40s, 500 more iterations: 8h 13m 23s.
[2025-08-20 16:37:56,524][__main__][INFO] - Starting iteration 544.
[2025-08-20 16:38:19,525][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 16:38:19,527][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 16:38:19,533][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 16:38:21,983][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 16:38:21,985][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 16:38:21,991][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 16:38:21,994][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data.
[2025-08-20 16:38:21,994][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets.
[2025-08-20 16:38:22,294][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:38:23,085][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:38:23,879][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:38:24,674][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:38:25,467][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:38:26,261][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:38:27,058][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:38:27,852][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:38:28,646][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:38:29,440][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:38:30,238][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:38:31,034][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:38:31,829][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:38:32,623][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:38:33,418][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:38:34,214][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:38:35,006][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:38:35,799][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:38:36,595][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:38:37,838][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:38:38,635][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:38:39,430][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:38:40,222][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:38:41,016][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:38:41,813][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:38:42,608][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:38:43,401][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:38:44,197][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:38:44,994][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:38:45,791][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:38:46,585][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:38:47,378][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:38:48,944][mllm.training.trainer_common][INFO] - For task: Apply reinforce step
,
            ΔVRAM Allocated: 2.288818359375e-05 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:26,
            Percentage of VRAM taken: 69.91579512747062%,
            
[2025-08-20 16:38:49,865][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/policy_optimizer_state.pt
[2025-08-20 16:38:49,866][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/critic_optimizer_state.pt
[2025-08-20 16:38:51,235][__main__][INFO] - Iteration 545 took 54s (37.58% Gen, 62.42% Train). Generation: 20s, Training: 34s. Estimated remaining time: 6h 43m 31s. Estimated total time: 15h 11m 51s. Time estimates for 10 more iterations: 9m 7s, 100 more iterations: 1h 31m 11s, 500 more iterations: 7h 35m 55s.
[2025-08-20 16:38:51,237][__main__][INFO] - Starting iteration 545.
[2025-08-20 16:39:14,141][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 16:39:14,143][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 16:39:14,149][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 16:39:16,603][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 16:39:16,604][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 16:39:16,610][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 16:39:16,613][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data.
[2025-08-20 16:39:16,613][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets.
[2025-08-20 16:39:16,915][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:39:17,703][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:39:18,497][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:39:19,288][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:39:20,082][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:39:20,876][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:39:21,669][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:39:22,463][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:39:23,260][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:39:24,054][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:39:24,847][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:39:25,641][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:39:26,437][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:39:27,228][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:39:28,020][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:39:28,815][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:39:29,608][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:39:30,399][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:39:31,193][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:39:32,510][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:39:33,303][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:39:34,099][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:39:34,895][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:39:35,689][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:39:36,482][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:39:37,278][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:39:38,075][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:39:38,870][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:39:39,664][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:39:40,458][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:39:41,256][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:39:42,052][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:39:43,625][mllm.training.trainer_common][INFO] - For task: Apply reinforce step
,
            ΔVRAM Allocated: 2.288818359375e-05 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:26,
            Percentage of VRAM taken: 69.91579512747062%,
            
[2025-08-20 16:39:44,708][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/policy_optimizer_state.pt
[2025-08-20 16:39:44,711][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/critic_optimizer_state.pt
[2025-08-20 16:39:46,015][__main__][INFO] - Iteration 546 took 54s (37.37% Gen, 62.63% Train). Generation: 20s, Training: 34s. Estimated remaining time: 6h 43m 42s. Estimated total time: 15h 12m 57s. Time estimates for 10 more iterations: 9m 7s, 100 more iterations: 1h 31m 17s, 500 more iterations: 7h 36m 28s.
[2025-08-20 16:39:46,017][__main__][INFO] - Starting iteration 546.
[2025-08-20 16:40:08,992][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 16:40:08,993][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 16:40:08,999][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 16:40:11,464][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 16:40:11,465][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 16:40:11,471][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 16:40:11,473][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data.
[2025-08-20 16:40:11,474][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets.
[2025-08-20 16:40:11,773][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:40:12,565][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:40:13,359][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:40:14,153][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:40:14,947][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:40:15,743][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:40:16,538][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:40:17,331][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:40:18,127][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:40:18,924][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:40:19,720][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:40:20,514][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:40:21,308][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:40:22,103][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:40:22,898][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:40:23,691][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:40:24,484][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:40:25,280][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:40:26,073][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:40:26,866][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:40:27,662][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:40:28,458][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:40:29,252][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:40:30,488][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:40:31,283][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:40:32,080][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:40:32,876][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:40:33,670][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:40:34,464][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:40:35,261][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:40:36,058][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:40:36,854][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:40:38,466][mllm.training.trainer_common][INFO] - For task: Apply reinforce step
,
            ΔVRAM Allocated: 2.288818359375e-05 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:26,
            Percentage of VRAM taken: 69.91579512747062%,
            
[2025-08-20 16:40:39,394][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/policy_optimizer_state.pt
[2025-08-20 16:40:39,396][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/critic_optimizer_state.pt
[2025-08-20 16:40:40,705][__main__][INFO] - Iteration 547 took 54s (37.52% Gen, 62.48% Train). Generation: 20s, Training: 34s. Estimated remaining time: 6h 41m 18s. Estimated total time: 15h 11m 28s. Time estimates for 10 more iterations: 9m 6s, 100 more iterations: 1h 31m 8s, 500 more iterations: 7h 35m 44s.
[2025-08-20 16:40:40,707][__main__][INFO] - Starting iteration 547.
[2025-08-20 16:41:03,818][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 16:41:03,820][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 16:41:03,826][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 16:41:06,257][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 16:41:06,258][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 16:41:06,264][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 16:41:06,267][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data.
[2025-08-20 16:41:06,267][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets.
[2025-08-20 16:41:06,566][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:41:07,357][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:41:08,150][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:41:08,947][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:41:09,742][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:41:10,536][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:41:11,331][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:41:12,127][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:41:12,923][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:41:13,717][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:41:14,511][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:41:15,308][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:41:16,105][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:41:16,898][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:41:17,691][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:41:18,485][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:41:19,280][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:41:20,073][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:41:21,280][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:41:22,072][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:41:22,866][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:41:23,662][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:41:24,456][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:41:25,249][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:41:26,043][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:41:26,839][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:41:27,632][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:41:28,425][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:41:29,222][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:41:30,018][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:41:30,811][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:41:31,605][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:41:33,168][mllm.training.trainer_common][INFO] - For task: Apply reinforce step
,
            ΔVRAM Allocated: 2.288818359375e-05 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:26,
            Percentage of VRAM taken: 69.91579512747062%,
            
[2025-08-20 16:41:34,123][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/policy_optimizer_state.pt
[2025-08-20 16:41:34,126][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/critic_optimizer_state.pt
[2025-08-20 16:41:35,605][__main__][INFO] - Iteration 548 took 54s (37.69% Gen, 62.31% Train). Generation: 20s, Training: 34s. Estimated remaining time: 6h 43m 52s. Estimated total time: 15h 14m 57s. Time estimates for 10 more iterations: 9m 8s, 100 more iterations: 1h 31m 29s, 500 more iterations: 7h 37m 28s.
[2025-08-20 16:41:35,607][__main__][INFO] - Starting iteration 548.
[2025-08-20 16:41:59,046][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 16:41:59,048][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 16:41:59,054][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 16:42:01,500][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 16:42:01,501][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 16:42:01,508][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 16:42:01,510][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data.
[2025-08-20 16:42:01,510][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets.
[2025-08-20 16:42:01,809][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:42:02,600][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:42:03,394][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:42:04,190][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:42:04,985][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:42:05,779][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:42:06,574][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:42:07,371][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:42:08,168][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:42:08,964][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:42:09,758][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:42:10,552][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:42:11,348][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:42:12,142][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:42:12,935][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:42:13,730][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:42:14,527][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:42:15,322][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:42:16,116][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:42:16,910][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:42:17,707][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:42:18,503][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:42:19,296][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:42:20,090][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:42:20,887][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:42:21,684][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:42:22,982][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:42:23,776][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:42:24,570][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:42:25,366][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:42:26,164][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:42:26,960][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:42:28,562][mllm.training.trainer_common][INFO] - For task: Apply reinforce step
,
            ΔVRAM Allocated: 2.288818359375e-05 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:27,
            Percentage of VRAM taken: 69.91579512747062%,
            
[2025-08-20 16:42:30,182][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/policy_optimizer_state.pt
[2025-08-20 16:42:30,184][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/critic_optimizer_state.pt
[2025-08-20 16:42:31,623][__main__][INFO] - Iteration 549 took 56s (37.47% Gen, 62.53% Train). Generation: 20s, Training: 35s. Estimated remaining time: 7h 1m 35s. Estimated total time: 15h 33m 36s. Time estimates for 10 more iterations: 9m 20s, 100 more iterations: 1h 33m 21s, 500 more iterations: 7h 46m 48s.
[2025-08-20 16:42:31,625][__main__][INFO] - Starting iteration 549.
[2025-08-20 16:42:55,016][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 16:42:55,018][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 16:42:55,024][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 16:42:57,473][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 16:42:57,474][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 16:42:57,481][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 16:42:57,483][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data.
[2025-08-20 16:42:57,484][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets.
[2025-08-20 16:42:57,782][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:42:58,573][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:42:59,368][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:43:00,160][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:43:00,954][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:43:01,750][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:43:02,543][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:43:03,337][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:43:04,133][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:43:04,929][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:43:05,723][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:43:06,514][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:43:07,310][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:43:08,104][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:43:08,896][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:43:09,691][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:43:10,486][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:43:11,279][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:43:12,073][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:43:12,870][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:43:13,666][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:43:14,917][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:43:15,711][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:43:16,505][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:43:17,299][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:43:18,092][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:43:18,888][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:43:19,684][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:43:20,478][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:43:21,272][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:43:22,068][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:43:22,865][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:43:24,524][mllm.training.trainer_common][INFO] - For task: Apply reinforce step
,
            ΔVRAM Allocated: 2.288818359375e-05 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:27,
            Percentage of VRAM taken: 69.91579512747062%,
            
[2025-08-20 16:43:25,457][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/policy_optimizer_state.pt
[2025-08-20 16:43:25,459][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/critic_optimizer_state.pt
[2025-08-20 16:43:26,842][__main__][INFO] - Iteration 550 took 55s (37.93% Gen, 62.06% Train). Generation: 20s, Training: 34s. Estimated remaining time: 6h 47m 21s. Estimated total time: 15h 20m 16s. Time estimates for 10 more iterations: 9m 12s, 100 more iterations: 1h 32m 1s, 500 more iterations: 7h 40m 8s.
[2025-08-20 16:43:26,844][__main__][INFO] - Starting iteration 550.
[2025-08-20 16:43:49,835][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 16:43:49,836][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 16:43:49,842][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 16:43:52,293][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 16:43:52,295][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 16:43:52,301][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 16:43:52,304][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data.
[2025-08-20 16:43:52,304][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets.
[2025-08-20 16:43:52,603][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:43:53,396][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:43:54,190][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:43:54,983][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:43:55,778][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:43:56,576][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:43:57,371][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:43:58,165][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:43:58,960][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:43:59,754][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:44:00,549][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:44:01,342][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:44:02,136][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:44:02,933][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:44:03,727][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:44:04,520][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:44:05,315][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:44:06,111][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:44:06,905][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:44:07,699][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:44:08,493][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:44:09,290][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:44:10,084][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:44:11,286][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:44:12,080][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:44:12,874][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:44:13,671][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:44:14,467][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:44:15,261][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:44:16,055][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:44:16,851][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:44:17,648][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:44:19,206][mllm.training.trainer_common][INFO] - For task: Apply reinforce step
,
            ΔVRAM Allocated: 2.288818359375e-05 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:26,
            Percentage of VRAM taken: 69.91579512747062%,
            
[2025-08-20 16:44:20,316][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/policy_optimizer_state.pt
[2025-08-20 16:44:20,318][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/critic_optimizer_state.pt
[2025-08-20 16:44:24,365][__main__][INFO] - Iteration 551 took 57s (35.68% Gen, 59.72% Train). Generation: 20s, Training: 34s. Estimated remaining time: 7h 24m 48s. Estimated total time: 15h 58m 41s. Time estimates for 10 more iterations: 9m 35s, 100 more iterations: 1h 35m 52s, 500 more iterations: 7h 59m 20s.
[2025-08-20 16:44:24,367][__main__][INFO] - Starting iteration 551.
[2025-08-20 16:44:47,492][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 16:44:47,493][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 16:44:47,499][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 16:44:49,952][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 16:44:49,954][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 16:44:49,960][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 16:44:49,963][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data.
[2025-08-20 16:44:49,963][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets.
[2025-08-20 16:44:50,269][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:44:51,059][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:44:51,852][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:44:52,644][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:44:53,439][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:44:54,235][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:44:55,028][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:44:55,819][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:44:56,615][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:44:57,406][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:44:58,199][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:44:58,994][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:44:59,787][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:45:00,580][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:45:01,376][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:45:02,170][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:45:02,963][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:45:03,757][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:45:04,553][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:45:05,347][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:45:06,140][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:45:06,934][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:45:07,730][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:45:09,011][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:45:09,804][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:45:10,595][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:45:11,390][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:45:12,181][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:45:12,973][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:45:13,767][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:45:14,559][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:45:15,352][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:45:16,918][mllm.training.trainer_common][INFO] - For task: Apply reinforce step
,
            ΔVRAM Allocated: 2.288818359375e-05 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:26,
            Percentage of VRAM taken: 69.91579512747062%,
            
[2025-08-20 16:45:17,864][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/policy_optimizer_state.pt
[2025-08-20 16:45:17,865][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/critic_optimizer_state.pt
[2025-08-20 16:45:19,241][__main__][INFO] - Iteration 552 took 54s (37.68% Gen, 62.32% Train). Generation: 20s, Training: 34s. Estimated remaining time: 6h 39m 45s. Estimated total time: 15h 14m 33s. Time estimates for 10 more iterations: 9m 8s, 100 more iterations: 1h 31m 27s, 500 more iterations: 7h 37m 16s.
[2025-08-20 16:45:19,243][__main__][INFO] - Starting iteration 552.
[2025-08-20 16:45:42,489][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 16:45:42,490][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 16:45:42,496][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 16:45:44,963][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 16:45:44,964][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 16:45:44,970][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 16:45:44,973][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data.
[2025-08-20 16:45:44,973][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets.
[2025-08-20 16:45:45,276][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:45:46,065][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:45:46,862][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:45:47,656][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:45:48,450][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:45:49,243][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:45:50,036][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:45:50,827][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:45:51,621][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:45:52,415][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:45:53,207][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:45:54,000][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:45:54,796][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:45:55,589][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:45:56,382][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:45:57,178][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:45:57,974][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:45:59,183][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:45:59,978][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:46:00,774][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:46:01,567][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:46:02,360][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:46:03,157][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:46:03,953][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:46:04,747][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:46:05,540][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:46:06,336][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:46:07,133][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:46:07,928][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:46:08,722][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:46:09,518][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:46:10,316][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:46:11,973][mllm.training.trainer_common][INFO] - For task: Apply reinforce step
,
            ΔVRAM Allocated: 2.288818359375e-05 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:26,
            Percentage of VRAM taken: 69.91579512747062%,
            
[2025-08-20 16:46:12,903][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/policy_optimizer_state.pt
[2025-08-20 16:46:12,904][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/critic_optimizer_state.pt
[2025-08-20 16:46:14,305][__main__][INFO] - Iteration 553 took 55s (37.73% Gen, 62.26% Train). Generation: 20s, Training: 34s. Estimated remaining time: 6h 41m 58s. Estimated total time: 15h 17m 41s. Time estimates for 10 more iterations: 9m 10s, 100 more iterations: 1h 31m 46s, 500 more iterations: 7h 38m 50s.
[2025-08-20 16:46:14,307][__main__][INFO] - Starting iteration 553.
[2025-08-20 16:46:37,434][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 16:46:37,436][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 16:46:37,442][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 16:46:39,909][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 16:46:39,910][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 16:46:39,917][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 16:46:39,919][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data.
[2025-08-20 16:46:39,920][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets.
[2025-08-20 16:46:40,219][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:46:41,009][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:46:41,804][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:46:42,599][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:46:43,393][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:46:44,186][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:46:44,981][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:46:45,774][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:46:46,566][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:46:47,361][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:46:48,154][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:46:48,947][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:46:49,743][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:46:50,539][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:46:51,333][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:46:52,126][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:46:52,922][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:46:53,717][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:46:54,511][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:46:55,304][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:46:56,098][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:46:56,892][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:46:57,685][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:46:58,481][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:46:59,279][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:47:00,512][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:47:01,305][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:47:02,099][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:47:02,897][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:47:03,691][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:47:04,483][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:47:05,278][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:47:06,847][mllm.training.trainer_common][INFO] - For task: Apply reinforce step
,
            ΔVRAM Allocated: 2.288818359375e-05 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:26,
            Percentage of VRAM taken: 69.91579512747062%,
            
[2025-08-20 16:47:07,821][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/policy_optimizer_state.pt
[2025-08-20 16:47:07,823][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/critic_optimizer_state.pt
[2025-08-20 16:47:09,153][__main__][INFO] - Iteration 554 took 54s (37.67% Gen, 62.33% Train). Generation: 20s, Training: 34s. Estimated remaining time: 6h 37m 27s. Estimated total time: 15h 14m 5s. Time estimates for 10 more iterations: 9m 8s, 100 more iterations: 1h 31m 24s, 500 more iterations: 7h 37m 2s.
[2025-08-20 16:47:09,154][__main__][INFO] - Starting iteration 554.
[2025-08-20 16:47:32,570][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 16:47:32,571][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 16:47:32,577][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 16:47:35,058][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 16:47:35,059][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 16:47:35,066][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 16:47:35,068][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data.
[2025-08-20 16:47:35,069][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets.
[2025-08-20 16:47:35,367][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:47:36,159][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:47:36,951][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:47:37,747][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:47:38,542][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:47:39,335][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:47:40,129][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:47:40,925][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:47:41,718][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:47:42,509][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:47:43,305][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:47:44,099][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:47:44,893][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:47:45,687][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:47:46,481][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:47:47,274][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:47:48,068][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:47:48,864][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:47:49,661][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:47:50,457][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:47:51,251][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:47:52,045][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:47:52,842][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:47:53,638][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:47:54,941][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:47:55,737][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:47:56,530][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:47:57,325][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:47:58,123][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:47:58,921][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:47:59,718][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:48:00,513][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:48:02,108][mllm.training.trainer_common][INFO] - For task: Apply reinforce step
,
            ΔVRAM Allocated: 2.288818359375e-05 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:27,
            Percentage of VRAM taken: 69.91579512747062%,
            
[2025-08-20 16:48:03,065][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/policy_optimizer_state.pt
[2025-08-20 16:48:03,067][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/critic_optimizer_state.pt
[2025-08-20 16:48:04,319][__main__][INFO] - Iteration 555 took 55s (37.98% Gen, 62.02% Train). Generation: 20s, Training: 34s. Estimated remaining time: 6h 41m 50s. Estimated total time: 15h 19m 24s. Time estimates for 10 more iterations: 9m 11s, 100 more iterations: 1h 31m 56s, 500 more iterations: 7h 39m 42s.
[2025-08-20 16:48:04,320][__main__][INFO] - Starting iteration 555.
[2025-08-20 16:48:27,388][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 16:48:27,390][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 16:48:27,396][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 16:48:29,824][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 16:48:29,825][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 16:48:29,831][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 16:48:29,833][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data.
[2025-08-20 16:48:29,834][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets.
[2025-08-20 16:48:30,132][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:48:30,923][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:48:31,713][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:48:32,508][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:48:33,301][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:48:34,091][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:48:34,886][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:48:35,677][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:48:36,468][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:48:37,261][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:48:38,053][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:48:38,847][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:48:39,641][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:48:40,433][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:48:41,226][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:48:42,019][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:48:42,811][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:48:43,606][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:48:44,400][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:48:45,193][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:48:45,987][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:48:46,783][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:48:47,576][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:48:48,368][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:48:49,687][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:48:50,484][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:48:51,280][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:48:52,074][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:48:52,868][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:48:53,664][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:48:54,460][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:48:55,255][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:48:56,826][mllm.training.trainer_common][INFO] - For task: Apply reinforce step
,
            ΔVRAM Allocated: 2.288818359375e-05 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:26,
            Percentage of VRAM taken: 69.91579512747062%,
            
[2025-08-20 16:48:57,834][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/policy_optimizer_state.pt
[2025-08-20 16:48:57,836][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/critic_optimizer_state.pt
[2025-08-20 16:48:59,394][__main__][INFO] - Iteration 556 took 55s (37.47% Gen, 62.53% Train). Generation: 20s, Training: 34s. Estimated remaining time: 6h 39m 25s. Estimated total time: 15h 17m 53s. Time estimates for 10 more iterations: 9m 10s, 100 more iterations: 1h 31m 47s, 500 more iterations: 7h 38m 56s.
[2025-08-20 16:48:59,396][__main__][INFO] - Starting iteration 556.
[2025-08-20 16:49:22,417][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 16:49:22,419][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 16:49:22,425][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 16:49:24,882][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 16:49:24,883][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 16:49:24,889][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 16:49:24,892][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data.
[2025-08-20 16:49:24,892][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets.
[2025-08-20 16:49:25,192][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:49:25,983][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:49:26,776][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:49:27,573][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:49:28,367][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:49:29,159][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:49:29,953][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:49:30,747][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:49:31,539][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:49:32,332][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:49:33,127][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:49:33,920][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:49:34,713][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:49:35,507][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:49:36,300][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:49:37,093][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:49:37,889][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:49:38,683][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:49:39,476][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:49:40,271][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:49:41,067][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:49:41,863][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:49:43,086][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:49:43,881][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:49:44,674][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:49:45,469][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:49:46,266][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:49:47,061][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:49:47,854][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:49:48,649][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:49:49,446][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:49:50,242][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:49:51,828][mllm.training.trainer_common][INFO] - For task: Apply reinforce step
,
            ΔVRAM Allocated: 2.288818359375e-05 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:26,
            Percentage of VRAM taken: 69.91579512747062%,
            
[2025-08-20 16:49:52,837][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/policy_optimizer_state.pt
[2025-08-20 16:49:52,840][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/critic_optimizer_state.pt
[2025-08-20 16:49:54,176][__main__][INFO] - Iteration 557 took 54s (37.58% Gen, 62.42% Train). Generation: 20s, Training: 34s. Estimated remaining time: 6h 33m 36s. Estimated total time: 15h 13m 0s. Time estimates for 10 more iterations: 9m 7s, 100 more iterations: 1h 31m 18s, 500 more iterations: 7h 36m 30s.
[2025-08-20 16:49:54,178][__main__][INFO] - Starting iteration 557.
[2025-08-20 16:50:17,205][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 16:50:17,207][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 16:50:17,213][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 16:50:19,663][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 16:50:19,664][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 16:50:19,671][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 16:50:19,673][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data.
[2025-08-20 16:50:19,674][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets.
[2025-08-20 16:50:19,971][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:50:20,761][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:50:21,556][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:50:22,350][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:50:23,142][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:50:23,934][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:50:24,727][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:50:25,518][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:50:26,313][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:50:27,106][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:50:27,899][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:50:28,694][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:50:29,490][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:50:30,283][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:50:31,076][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:50:31,874][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:50:32,670][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:50:33,464][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:50:34,797][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:50:35,592][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:50:36,388][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:50:37,182][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:50:37,974][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:50:38,770][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:50:39,565][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:50:40,358][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:50:41,152][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:50:41,949][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:50:42,746][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:50:43,541][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:50:44,334][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:50:45,130][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:50:46,728][mllm.training.trainer_common][INFO] - For task: Apply reinforce step
,
            ΔVRAM Allocated: 2.288818359375e-05 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:27,
            Percentage of VRAM taken: 69.91579512747062%,
            
[2025-08-20 16:50:47,683][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/policy_optimizer_state.pt
[2025-08-20 16:50:47,685][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/critic_optimizer_state.pt
[2025-08-20 16:50:49,211][__main__][INFO] - Iteration 558 took 55s (37.40% Gen, 62.60% Train). Generation: 20s, Training: 34s. Estimated remaining time: 6h 36m 54s. Estimated total time: 15h 17m 12s. Time estimates for 10 more iterations: 9m 10s, 100 more iterations: 1h 31m 43s, 500 more iterations: 7h 38m 36s.
[2025-08-20 16:50:49,213][__main__][INFO] - Starting iteration 558.
[2025-08-20 16:51:12,654][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 16:51:12,655][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 16:51:12,661][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 16:51:15,106][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 16:51:15,107][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 16:51:15,113][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 16:51:15,116][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data.
[2025-08-20 16:51:15,116][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets.
[2025-08-20 16:51:15,438][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:51:16,229][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:51:17,021][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:51:17,817][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:51:18,609][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:51:19,400][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:51:20,196][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:51:20,988][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:51:21,781][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:51:22,576][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:51:23,372][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:51:24,165][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:51:24,958][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:51:25,755][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:51:26,549][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:51:27,343][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:51:28,138][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:51:28,935][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:51:29,732][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:51:30,525][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:51:31,318][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:51:32,113][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:51:32,909][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:51:33,703][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:51:34,915][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:51:35,709][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:51:36,503][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:51:37,297][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:51:38,093][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:51:38,889][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:51:39,692][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:51:40,488][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:51:42,084][mllm.training.trainer_common][INFO] - For task: Apply reinforce step
,
            ΔVRAM Allocated: 2.288818359375e-05 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:26,
            Percentage of VRAM taken: 69.91579512747062%,
            
[2025-08-20 16:51:43,045][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/policy_optimizer_state.pt
[2025-08-20 16:51:43,046][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/critic_optimizer_state.pt
[2025-08-20 16:51:44,381][__main__][INFO] - Iteration 559 took 55s (38.05% Gen, 61.95% Train). Generation: 20s, Training: 34s. Estimated remaining time: 6h 38m 14s. Estimated total time: 15h 19m 28s. Time estimates for 10 more iterations: 9m 11s, 100 more iterations: 1h 31m 56s, 500 more iterations: 7h 39m 44s.
[2025-08-20 16:51:44,384][__main__][INFO] - Starting iteration 559.
[2025-08-20 16:52:07,483][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 16:52:07,484][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 16:52:07,491][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 16:52:09,943][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 16:52:09,945][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 16:52:09,951][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 16:52:09,954][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data.
[2025-08-20 16:52:09,954][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets.
[2025-08-20 16:52:10,256][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:52:11,047][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:52:11,841][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:52:12,635][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:52:13,426][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:52:14,220][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:52:15,014][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:52:15,806][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:52:16,599][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:52:17,393][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:52:18,186][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:52:18,980][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:52:19,776][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:52:20,569][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:52:21,362][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:52:22,158][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:52:22,953][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:52:23,747][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:52:24,540][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:52:25,337][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:52:26,132][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:52:26,925][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:52:27,719][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:52:28,516][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:52:29,312][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:52:30,106][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:52:31,350][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:52:32,145][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:52:32,940][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:52:33,736][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:52:34,533][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:52:35,328][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:52:36,894][mllm.training.trainer_common][INFO] - For task: Apply reinforce step
,
            ΔVRAM Allocated: 2.288818359375e-05 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:26,
            Percentage of VRAM taken: 69.91579512747062%,
            
[2025-08-20 16:52:37,896][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/policy_optimizer_state.pt
[2025-08-20 16:52:37,898][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/critic_optimizer_state.pt
[2025-08-20 16:52:39,531][__main__][INFO] - Iteration 560 took 55s (37.44% Gen, 62.56% Train). Generation: 20s, Training: 34s. Estimated remaining time: 6h 36m 56s. Estimated total time: 15h 19m 5s. Time estimates for 10 more iterations: 9m 11s, 100 more iterations: 1h 31m 54s, 500 more iterations: 7h 39m 32s.
[2025-08-20 16:52:39,533][__main__][INFO] - Starting iteration 560.
[2025-08-20 16:53:02,733][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 16:53:02,734][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 16:53:02,740][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 16:53:05,185][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 16:53:05,187][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 16:53:05,194][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 16:53:05,196][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data.
[2025-08-20 16:53:05,196][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets.
[2025-08-20 16:53:05,495][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:53:06,284][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:53:07,077][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:53:07,868][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:53:08,663][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:53:09,456][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:53:10,247][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:53:11,040][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:53:11,833][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:53:12,626][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:53:13,423][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:53:14,219][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:53:15,012][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:53:15,805][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:53:16,601][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:53:17,397][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:53:18,190][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:53:18,983][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:53:19,780][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:53:20,575][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:53:21,370][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:53:22,164][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:53:23,539][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:53:24,334][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:53:25,128][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:53:25,922][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:53:26,716][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:53:27,510][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:53:28,304][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:53:29,101][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:53:29,898][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:53:30,695][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:53:32,257][mllm.training.trainer_common][INFO] - For task: Apply reinforce step
,
            ΔVRAM Allocated: 2.288818359375e-05 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:27,
            Percentage of VRAM taken: 69.91579512747062%,
            
[2025-08-20 16:53:33,304][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/policy_optimizer_state.pt
[2025-08-20 16:53:33,308][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/critic_optimizer_state.pt
[2025-08-20 16:53:34,673][__main__][INFO] - Iteration 561 took 55s (37.64% Gen, 62.36% Train). Generation: 20s, Training: 34s. Estimated remaining time: 6h 35m 56s. Estimated total time: 15h 18m 59s. Time estimates for 10 more iterations: 9m 11s, 100 more iterations: 1h 31m 53s, 500 more iterations: 7h 39m 29s.
[2025-08-20 16:53:34,676][__main__][INFO] - Starting iteration 561.
[2025-08-20 16:53:58,332][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 16:53:58,333][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 16:53:58,340][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 16:54:00,781][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 16:54:00,782][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 16:54:00,788][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 16:54:00,790][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data.
[2025-08-20 16:54:00,791][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets.
[2025-08-20 16:54:01,090][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:54:01,882][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:54:02,674][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:54:03,467][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:54:04,261][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:54:05,053][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:54:05,847][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:54:06,642][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:54:07,435][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:54:08,228][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:54:09,024][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:54:09,817][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:54:10,610][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:54:11,405][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:54:12,201][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:54:12,994][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:54:13,788][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:54:14,582][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:54:15,377][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:54:16,170][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:54:16,965][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:54:18,171][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:54:18,965][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:54:19,762][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:54:20,557][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:54:21,350][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:54:22,145][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:54:22,941][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:54:23,737][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:54:24,530][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:54:25,325][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:54:26,122][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:54:27,714][mllm.training.trainer_common][INFO] - For task: Apply reinforce step
,
            ΔVRAM Allocated: 2.288818359375e-05 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:26,
            Percentage of VRAM taken: 69.91579512747062%,
            
[2025-08-20 16:54:28,960][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/policy_optimizer_state.pt
[2025-08-20 16:54:28,963][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/critic_optimizer_state.pt
[2025-08-20 16:54:30,353][__main__][INFO] - Iteration 562 took 55s (38.05% Gen, 61.94% Train). Generation: 21s, Training: 34s. Estimated remaining time: 6h 43m 54s. Estimated total time: 15h 27m 53s. Time estimates for 10 more iterations: 9m 16s, 100 more iterations: 1h 32m 47s, 500 more iterations: 7h 43m 56s.
[2025-08-20 16:54:30,355][__main__][INFO] - Starting iteration 562.
[2025-08-20 16:54:53,353][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 16:54:53,355][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 16:54:53,361][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 16:54:55,809][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 16:54:55,810][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 16:54:55,817][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 16:54:55,819][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data.
[2025-08-20 16:54:55,819][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets.
[2025-08-20 16:54:56,119][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:54:56,908][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:54:57,698][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:54:58,489][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:54:59,281][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:55:00,073][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:55:00,867][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:55:01,659][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:55:02,451][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:55:03,246][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:55:04,037][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:55:04,829][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:55:05,625][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:55:06,418][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:55:07,210][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:55:08,005][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:55:08,799][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:55:09,591][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:55:10,388][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:55:11,185][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:55:12,430][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:55:13,226][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:55:14,023][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:55:14,818][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:55:15,612][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:55:16,406][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:55:17,204][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:55:17,999][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:55:18,793][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:55:19,587][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:55:20,384][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:55:21,181][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:55:22,748][mllm.training.trainer_common][INFO] - For task: Apply reinforce step
,
            ΔVRAM Allocated: 2.288818359375e-05 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:26,
            Percentage of VRAM taken: 69.91579512747062%,
            
[2025-08-20 16:55:23,756][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/policy_optimizer_state.pt
[2025-08-20 16:55:23,758][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/critic_optimizer_state.pt
[2025-08-20 16:55:25,648][__main__][INFO] - Iteration 563 took 55s (37.18% Gen, 62.82% Train). Generation: 20s, Training: 34s. Estimated remaining time: 6h 36m 38s. Estimated total time: 15h 21m 33s. Time estimates for 10 more iterations: 9m 12s, 100 more iterations: 1h 32m 9s, 500 more iterations: 7h 40m 46s.
[2025-08-20 16:55:25,650][__main__][INFO] - Starting iteration 563.
[2025-08-20 16:55:49,047][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 16:55:49,048][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 16:55:49,055][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 16:55:51,537][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 16:55:51,538][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 16:55:51,545][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 16:55:51,547][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data.
[2025-08-20 16:55:51,547][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets.
[2025-08-20 16:55:51,846][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:55:52,635][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:55:53,430][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:55:54,222][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:55:55,014][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:55:55,810][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:55:56,603][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:55:57,396][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:55:58,191][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:55:58,986][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:55:59,780][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:56:00,573][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:56:01,367][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:56:02,160][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:56:02,953][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:56:03,749][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:56:04,544][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:56:05,338][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:56:06,132][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:56:06,929][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:56:07,726][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:56:09,027][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:56:09,819][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:56:10,610][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:56:11,406][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:56:12,199][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:56:12,991][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:56:13,785][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:56:14,579][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:56:15,372][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:56:16,166][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:56:16,961][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:56:18,544][mllm.training.trainer_common][INFO] - For task: Apply reinforce step
,
            ΔVRAM Allocated: 2.288818359375e-05 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:26,
            Percentage of VRAM taken: 69.91579512747062%,
            
[2025-08-20 16:56:19,585][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/policy_optimizer_state.pt
[2025-08-20 16:56:19,588][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/critic_optimizer_state.pt
[2025-08-20 16:56:20,918][__main__][INFO] - Iteration 564 took 55s (37.86% Gen, 62.14% Train). Generation: 20s, Training: 34s. Estimated remaining time: 6h 35m 17s. Estimated total time: 15h 21m 7s. Time estimates for 10 more iterations: 9m 12s, 100 more iterations: 1h 32m 6s, 500 more iterations: 7h 40m 33s.
[2025-08-20 16:56:20,920][__main__][INFO] - Starting iteration 564.
[2025-08-20 16:56:44,053][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 16:56:44,055][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 16:56:44,061][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 16:56:46,493][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 16:56:46,494][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 16:56:46,500][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 16:56:46,502][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data.
[2025-08-20 16:56:46,503][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets.
[2025-08-20 16:56:46,801][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:56:47,592][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:56:48,383][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:56:49,176][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:56:49,970][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:56:50,763][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:56:51,556][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:56:52,352][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:56:53,146][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:56:53,995][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:56:54,791][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:56:55,585][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:56:56,378][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:56:57,174][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:56:57,970][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:56:58,765][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:56:59,559][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:57:00,353][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:57:01,568][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:57:02,360][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:57:03,152][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:57:03,948][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:57:04,740][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:57:05,532][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:57:06,328][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:57:07,122][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:57:07,914][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:57:08,709][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:57:09,504][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:57:10,296][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:57:11,090][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:57:11,886][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:57:13,448][mllm.training.trainer_common][INFO] - For task: Apply reinforce step
,
            ΔVRAM Allocated: 2.288818359375e-05 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:26,
            Percentage of VRAM taken: 69.91579512747062%,
            
[2025-08-20 16:57:14,458][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/policy_optimizer_state.pt
[2025-08-20 16:57:14,460][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/critic_optimizer_state.pt
[2025-08-20 16:57:15,820][__main__][INFO] - Iteration 565 took 54s (37.70% Gen, 62.30% Train). Generation: 20s, Training: 34s. Estimated remaining time: 6h 28m 14s. Estimated total time: 15h 14m 59s. Time estimates for 10 more iterations: 9m 8s, 100 more iterations: 1h 31m 29s, 500 more iterations: 7h 37m 29s.
[2025-08-20 16:57:15,821][__main__][INFO] - Starting iteration 565.
[2025-08-20 16:57:39,101][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 16:57:39,103][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 16:57:39,109][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 16:57:41,566][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 16:57:41,567][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 16:57:41,574][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 16:57:41,576][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data.
[2025-08-20 16:57:41,576][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets.
[2025-08-20 16:57:41,877][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:57:42,666][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:57:43,458][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:57:44,250][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:57:45,042][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:57:45,837][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:57:46,631][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:57:47,423][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:57:48,217][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:57:49,011][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:57:49,805][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:57:50,598][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:57:51,395][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:57:52,189][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:57:52,983][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:57:53,777][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:57:54,574][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:57:55,370][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:57:56,163][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:57:56,957][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:57:57,754][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:57:58,549][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:57:59,343][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:58:00,596][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:58:01,392][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:58:02,186][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:58:02,978][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:58:03,773][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:58:04,569][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:58:05,362][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:58:06,155][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:58:06,952][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:58:08,518][mllm.training.trainer_common][INFO] - For task: Apply reinforce step
,
            ΔVRAM Allocated: 2.288818359375e-05 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:26,
            Percentage of VRAM taken: 69.91579512747062%,
            
[2025-08-20 16:58:09,679][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/policy_optimizer_state.pt
[2025-08-20 16:58:09,682][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/critic_optimizer_state.pt
[2025-08-20 16:58:11,152][__main__][INFO] - Iteration 566 took 55s (37.65% Gen, 62.35% Train). Generation: 20s, Training: 34s. Estimated remaining time: 6h 34m 30s. Estimated total time: 15h 22m 10s. Time estimates for 10 more iterations: 9m 13s, 100 more iterations: 1h 32m 13s, 500 more iterations: 7h 41m 5s.
[2025-08-20 16:58:11,154][__main__][INFO] - Starting iteration 566.
[2025-08-20 16:58:34,389][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 16:58:34,391][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 16:58:34,397][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 16:58:36,874][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 16:58:36,875][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 16:58:36,882][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 16:58:36,884][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data.
[2025-08-20 16:58:36,885][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets.
[2025-08-20 16:58:37,183][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:58:37,974][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:58:38,765][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:58:39,560][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:58:40,353][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:58:41,145][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:58:41,940][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:58:42,733][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:58:43,526][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:58:44,320][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:58:45,117][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:58:45,909][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:58:46,703][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:58:47,497][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:58:48,292][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:58:49,085][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:58:49,880][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:58:51,162][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:58:51,956][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:58:52,752][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:58:53,545][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:58:54,340][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:58:55,136][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:58:55,932][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:58:56,725][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:58:57,519][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:58:58,316][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:58:59,114][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:58:59,910][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:59:00,704][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:59:01,498][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:59:02,295][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:59:03,882][mllm.training.trainer_common][INFO] - For task: Apply reinforce step
,
            ΔVRAM Allocated: 2.288818359375e-05 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:26,
            Percentage of VRAM taken: 69.91579512747062%,
            
[2025-08-20 16:59:04,845][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/policy_optimizer_state.pt
[2025-08-20 16:59:04,847][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/critic_optimizer_state.pt
[2025-08-20 16:59:06,133][__main__][INFO] - Iteration 567 took 54s (37.78% Gen, 62.22% Train). Generation: 20s, Training: 34s. Estimated remaining time: 6h 27m 43s. Estimated total time: 15h 16m 19s. Time estimates for 10 more iterations: 9m 9s, 100 more iterations: 1h 31m 37s, 500 more iterations: 7h 38m 9s.
[2025-08-20 16:59:06,135][__main__][INFO] - Starting iteration 567.
[2025-08-20 16:59:29,628][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 16:59:29,629][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 16:59:29,635][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 16:59:32,075][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 16:59:32,077][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 16:59:32,083][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 16:59:32,085][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data.
[2025-08-20 16:59:32,085][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets.
[2025-08-20 16:59:32,383][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:59:33,175][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:59:33,966][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:59:34,761][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:59:35,554][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:59:36,346][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:59:37,142][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:59:37,936][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:59:38,729][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:59:39,523][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:59:40,318][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:59:41,112][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:59:41,904][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:59:42,701][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:59:43,495][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:59:44,288][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:59:45,082][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:59:45,878][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:59:46,672][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:59:47,466][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:59:48,260][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:59:49,057][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:59:49,852][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:59:50,646][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:59:51,851][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:59:52,645][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:59:53,439][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:59:54,333][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:59:55,127][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:59:55,921][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:59:56,717][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:59:57,513][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 16:59:59,107][mllm.training.trainer_common][INFO] - For task: Apply reinforce step
,
            ΔVRAM Allocated: 2.288818359375e-05 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:26,
            Percentage of VRAM taken: 69.91579512747062%,
            
[2025-08-20 17:00:00,084][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/policy_optimizer_state.pt
[2025-08-20 17:00:00,086][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/critic_optimizer_state.pt
[2025-08-20 17:00:01,494][__main__][INFO] - Iteration 568 took 55s (38.01% Gen, 61.99% Train). Generation: 21s, Training: 34s. Estimated remaining time: 6h 33m 7s. Estimated total time: 15h 22m 38s. Time estimates for 10 more iterations: 9m 13s, 100 more iterations: 1h 32m 15s, 500 more iterations: 7h 41m 19s.
[2025-08-20 17:00:01,495][__main__][INFO] - Starting iteration 568.
[2025-08-20 17:00:25,015][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 17:00:25,016][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 17:00:25,023][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 17:00:27,465][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 17:00:28,045][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 17:00:28,802][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 17:00:28,807][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data.
[2025-08-20 17:00:28,808][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets.
[2025-08-20 17:00:29,107][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:00:29,899][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:00:30,690][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:00:31,484][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:00:32,276][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:00:33,068][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:00:33,862][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:00:34,655][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:00:35,447][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:00:36,242][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:00:37,036][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:00:37,828][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:00:38,623][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:00:39,417][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:00:40,210][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:00:41,003][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:00:41,797][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:00:42,590][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:00:43,384][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:00:44,642][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:00:45,437][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:00:46,231][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:00:47,024][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:00:47,820][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:00:48,614][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:00:49,407][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:00:50,201][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:00:50,997][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:00:51,789][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:00:52,583][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:00:53,380][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:00:54,176][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:00:55,761][mllm.training.trainer_common][INFO] - For task: Apply reinforce step
,
            ΔVRAM Allocated: 2.288818359375e-05 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:26,
            Percentage of VRAM taken: 69.91579512747062%,
            
[2025-08-20 17:00:56,706][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/policy_optimizer_state.pt
[2025-08-20 17:00:56,708][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/critic_optimizer_state.pt
[2025-08-20 17:00:58,136][__main__][INFO] - Iteration 569 took 56s (37.22% Gen, 62.78% Train). Generation: 21s, Training: 35s. Estimated remaining time: 6h 53m 33s. Estimated total time: 15h 44m 0s. Time estimates for 10 more iterations: 9m 26s, 100 more iterations: 1h 34m 24s, 500 more iterations: 7h 52m 0s.
[2025-08-20 17:00:58,137][__main__][INFO] - Starting iteration 569.
[2025-08-20 17:01:21,305][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 17:01:21,307][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 17:01:21,313][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 17:01:23,753][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 17:01:23,755][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 17:01:23,761][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 17:01:23,763][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data.
[2025-08-20 17:01:23,764][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets.
[2025-08-20 17:01:24,086][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:01:24,876][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:01:25,669][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:01:26,462][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:01:27,254][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:01:28,048][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:01:28,843][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:01:29,636][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:01:30,429][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:01:31,225][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:01:32,018][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:01:32,811][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:01:33,606][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:01:34,400][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:01:35,194][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:01:35,987][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:01:36,784][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:01:37,579][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:01:38,372][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:01:39,166][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:01:39,963][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:01:40,758][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:01:42,032][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:01:42,825][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:01:43,620][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:01:44,412][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:01:45,205][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:01:46,001][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:01:46,794][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:01:47,586][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:01:48,382][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:01:49,179][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:01:50,814][mllm.training.trainer_common][INFO] - For task: Apply reinforce step
,
            ΔVRAM Allocated: 2.288818359375e-05 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:27,
            Percentage of VRAM taken: 69.91579512747062%,
            
[2025-08-20 17:01:51,811][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/policy_optimizer_state.pt
[2025-08-20 17:01:51,812][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/critic_optimizer_state.pt
[2025-08-20 17:01:53,139][__main__][INFO] - Iteration 570 took 55s (37.71% Gen, 62.29% Train). Generation: 20s, Training: 34s. Estimated remaining time: 6h 25m 19s. Estimated total time: 15h 16m 41s. Time estimates for 10 more iterations: 9m 10s, 100 more iterations: 1h 31m 40s, 500 more iterations: 7h 38m 20s.
[2025-08-20 17:01:53,141][__main__][INFO] - Starting iteration 570.
[2025-08-20 17:02:16,259][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 17:02:16,260][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 17:02:16,266][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 17:02:18,723][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 17:02:18,725][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 17:02:18,731][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 17:02:18,733][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data.
[2025-08-20 17:02:18,734][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets.
[2025-08-20 17:02:19,032][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:02:19,823][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:02:20,613][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:02:21,409][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:02:22,202][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:02:22,993][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:02:23,789][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:02:24,583][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:02:25,375][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:02:26,171][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:02:26,966][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:02:27,758][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:02:28,551][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:02:29,348][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:02:30,144][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:02:30,936][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:02:31,730][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:02:32,526][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:02:33,321][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:02:34,114][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:02:34,909][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:02:36,137][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:02:36,930][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:02:37,727][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:02:38,522][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:02:39,315][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:02:40,109][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:02:40,906][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:02:41,701][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:02:42,495][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:02:43,289][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:02:44,086][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:02:45,664][mllm.training.trainer_common][INFO] - For task: Apply reinforce step
,
            ΔVRAM Allocated: 2.288818359375e-05 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:26,
            Percentage of VRAM taken: 69.91579512747062%,
            
[2025-08-20 17:02:46,630][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/policy_optimizer_state.pt
[2025-08-20 17:02:46,632][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/critic_optimizer_state.pt
[2025-08-20 17:02:48,211][__main__][INFO] - Iteration 571 took 55s (37.51% Gen, 62.48% Train). Generation: 20s, Training: 34s. Estimated remaining time: 6h 25m 32s. Estimated total time: 15h 17m 49s. Time estimates for 10 more iterations: 9m 10s, 100 more iterations: 1h 31m 46s, 500 more iterations: 7h 38m 54s.
[2025-08-20 17:02:48,212][__main__][INFO] - Starting iteration 571.
[2025-08-20 17:03:11,806][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 17:03:11,807][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 17:03:11,813][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 17:03:14,254][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 17:03:14,256][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 17:03:14,262][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 17:03:14,264][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data.
[2025-08-20 17:03:14,265][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets.
[2025-08-20 17:03:14,564][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:03:15,353][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:03:16,144][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:03:16,935][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:03:17,729][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:03:18,519][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:03:19,312][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:03:20,106][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:03:20,898][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:03:21,692][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:03:22,485][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:03:23,277][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:03:24,072][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:03:24,867][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:03:25,659][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:03:26,452][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:03:27,247][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:03:28,496][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:03:29,290][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:03:30,085][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:03:30,880][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:03:31,674][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:03:32,470][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:03:33,266][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:03:34,060][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:03:34,853][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:03:35,650][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:03:36,446][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:03:37,240][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:03:38,034][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:03:38,830][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:03:39,627][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:03:41,242][mllm.training.trainer_common][INFO] - For task: Apply reinforce step
,
            ΔVRAM Allocated: 2.288818359375e-05 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:26,
            Percentage of VRAM taken: 69.91579512747062%,
            
[2025-08-20 17:03:42,185][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/policy_optimizer_state.pt
[2025-08-20 17:03:42,187][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/critic_optimizer_state.pt
[2025-08-20 17:03:43,555][__main__][INFO] - Iteration 572 took 55s (38.19% Gen, 61.81% Train). Generation: 21s, Training: 34s. Estimated remaining time: 6h 29m 9s. Estimated total time: 15h 22m 21s. Time estimates for 10 more iterations: 9m 13s, 100 more iterations: 1h 32m 14s, 500 more iterations: 7h 41m 10s.
[2025-08-20 17:03:43,556][__main__][INFO] - Starting iteration 572.
[2025-08-20 17:04:06,745][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 17:04:06,747][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 17:04:06,753][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 17:04:09,245][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 17:04:09,247][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 17:04:09,253][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 17:04:09,255][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data.
[2025-08-20 17:04:09,256][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets.
[2025-08-20 17:04:09,559][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:04:10,352][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:04:11,143][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:04:11,936][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:04:12,728][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:04:13,519][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:04:14,314][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:04:15,108][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:04:15,901][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:04:16,696][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:04:17,490][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:04:18,283][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:04:19,075][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:04:19,871][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:04:20,663][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:04:21,456][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:04:22,253][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:04:23,047][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:04:23,842][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:04:24,635][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:04:25,431][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:04:26,718][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:04:27,512][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:04:28,307][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:04:29,100][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:04:29,893][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:04:30,690][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:04:31,484][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:04:32,279][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:04:33,074][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:04:33,871][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:04:34,668][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:04:36,311][mllm.training.trainer_common][INFO] - For task: Apply reinforce step
,
            ΔVRAM Allocated: 2.288818359375e-05 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:27,
            Percentage of VRAM taken: 69.91579512747062%,
            
[2025-08-20 17:04:37,286][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/policy_optimizer_state.pt
[2025-08-20 17:04:37,288][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/critic_optimizer_state.pt
[2025-08-20 17:04:38,756][__main__][INFO] - Iteration 573 took 55s (37.54% Gen, 62.46% Train). Generation: 20s, Training: 34s. Estimated remaining time: 6h 25m 50s. Estimated total time: 15h 19m 58s. Time estimates for 10 more iterations: 9m 11s, 100 more iterations: 1h 31m 59s, 500 more iterations: 7h 39m 59s.
[2025-08-20 17:04:38,757][__main__][INFO] - Starting iteration 573.
[2025-08-20 17:05:02,087][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 17:05:02,088][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 17:05:02,094][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 17:05:04,540][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 17:05:04,541][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 17:05:04,547][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 17:05:04,550][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data.
[2025-08-20 17:05:04,550][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets.
[2025-08-20 17:05:04,848][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:05:05,638][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:05:06,429][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:05:07,221][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:05:08,016][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:05:08,808][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:05:09,602][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:05:10,397][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:05:11,191][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:05:11,983][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:05:12,778][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:05:13,574][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:05:14,368][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:05:15,162][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:05:15,958][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:05:16,754][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:05:17,548][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:05:18,341][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:05:19,137][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:05:19,933][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:05:21,207][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:05:22,000][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:05:22,796][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:05:23,593][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:05:24,388][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:05:25,182][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:05:25,976][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:05:26,774][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:05:27,572][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:05:28,368][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:05:29,163][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:05:29,957][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:05:31,572][mllm.training.trainer_common][INFO] - For task: Apply reinforce step
,
            ΔVRAM Allocated: 2.288818359375e-05 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:26,
            Percentage of VRAM taken: 69.91579512747062%,
            
[2025-08-20 17:05:32,907][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/policy_optimizer_state.pt
[2025-08-20 17:05:32,910][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/critic_optimizer_state.pt
[2025-08-20 17:05:34,862][__main__][INFO] - Iteration 574 took 56s (37.25% Gen, 62.75% Train). Generation: 20s, Training: 35s. Estimated remaining time: 6h 40m 0s. Estimated total time: 15h 35m 4s. Time estimates for 10 more iterations: 9m 21s, 100 more iterations: 1h 33m 30s, 500 more iterations: 7h 47m 32s.
[2025-08-20 17:05:34,864][__main__][INFO] - Starting iteration 574.
[2025-08-20 17:05:57,872][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 17:05:57,873][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 17:05:57,880][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 17:06:00,327][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 17:06:00,329][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 17:06:00,335][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 17:06:00,337][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data.
[2025-08-20 17:06:00,338][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets.
[2025-08-20 17:06:00,638][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:06:01,427][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:06:02,221][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:06:03,015][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:06:03,807][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:06:04,602][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:06:05,397][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:06:06,190][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:06:06,983][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:06:07,780][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:06:08,574][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:06:09,367][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:06:10,162][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:06:10,958][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:06:11,753][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:06:12,546][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:06:13,340][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:06:14,392][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:06:15,681][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:06:16,476][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:06:17,270][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:06:18,062][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:06:18,858][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:06:19,653][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:06:20,445][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:06:21,239][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:06:22,035][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:06:22,829][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:06:23,622][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:06:24,418][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:06:25,214][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:06:26,009][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:06:27,620][mllm.training.trainer_common][INFO] - For task: Apply reinforce step
,
            ΔVRAM Allocated: 2.288818359375e-05 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:27,
            Percentage of VRAM taken: 69.91579512747062%,
            
[2025-08-20 17:06:28,568][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/policy_optimizer_state.pt
[2025-08-20 17:06:28,570][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/critic_optimizer_state.pt
[2025-08-20 17:06:30,056][__main__][INFO] - Iteration 575 took 55s (37.27% Gen, 62.73% Train). Generation: 20s, Training: 34s. Estimated remaining time: 6h 23m 52s. Estimated total time: 15h 19m 51s. Time estimates for 10 more iterations: 9m 11s, 100 more iterations: 1h 31m 59s, 500 more iterations: 7h 39m 55s.
[2025-08-20 17:06:30,058][__main__][INFO] - Starting iteration 575.
[2025-08-20 17:06:53,103][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 17:06:53,104][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 17:06:53,111][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 17:06:55,550][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 17:06:55,551][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 17:06:55,557][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 17:06:55,560][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data.
[2025-08-20 17:06:55,560][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets.
[2025-08-20 17:06:55,858][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:06:56,648][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:06:57,442][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:06:58,234][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:06:59,027][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:06:59,822][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:07:00,616][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:07:01,408][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:07:02,203][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:07:02,998][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:07:03,791][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:07:04,585][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:07:05,381][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:07:06,174][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:07:06,967][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:07:07,763][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:07:08,560][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:07:09,355][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:07:10,148][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:07:10,943][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:07:11,740][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:07:12,534][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:07:13,763][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:07:14,559][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:07:15,353][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:07:16,146][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:07:16,941][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:07:17,738][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:07:18,534][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:07:19,327][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:07:20,123][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:07:20,919][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:07:22,580][mllm.training.trainer_common][INFO] - For task: Apply reinforce step
,
            ΔVRAM Allocated: 2.288818359375e-05 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:26,
            Percentage of VRAM taken: 69.91579512747062%,
            
[2025-08-20 17:07:23,625][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/policy_optimizer_state.pt
[2025-08-20 17:07:23,627][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/critic_optimizer_state.pt
[2025-08-20 17:07:24,936][__main__][INFO] - Iteration 576 took 54s (37.54% Gen, 62.46% Train). Generation: 20s, Training: 34s. Estimated remaining time: 6h 17m 43s. Estimated total time: 15h 14m 37s. Time estimates for 10 more iterations: 9m 8s, 100 more iterations: 1h 31m 27s, 500 more iterations: 7h 37m 18s.
[2025-08-20 17:07:24,937][__main__][INFO] - Starting iteration 576.
[2025-08-20 17:07:48,052][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 17:07:48,054][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 17:07:48,061][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 17:07:50,510][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 17:07:50,511][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 17:07:50,517][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 17:07:50,519][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data.
[2025-08-20 17:07:50,520][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets.
[2025-08-20 17:07:50,819][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:07:51,608][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:07:52,402][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:07:53,193][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:07:53,987][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:07:54,781][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:07:55,574][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:07:56,368][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:07:57,163][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:07:57,955][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:07:58,749][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:07:59,546][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:08:00,341][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:08:01,135][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:08:01,928][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:08:02,724][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:08:03,519][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:08:04,312][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:08:05,107][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:08:05,904][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:08:06,701][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:08:07,495][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:08:08,696][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:08:09,488][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:08:10,282][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:08:11,075][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:08:11,867][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:08:12,662][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:08:13,455][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:08:14,247][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:08:15,043][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:08:15,838][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:08:17,446][mllm.training.trainer_common][INFO] - For task: Apply reinforce step
,
            ΔVRAM Allocated: 2.288818359375e-05 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:26,
            Percentage of VRAM taken: 69.91579512747062%,
            
[2025-08-20 17:08:18,406][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/policy_optimizer_state.pt
[2025-08-20 17:08:18,408][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/critic_optimizer_state.pt
[2025-08-20 17:08:19,952][__main__][INFO] - Iteration 577 took 55s (37.59% Gen, 62.41% Train). Generation: 20s, Training: 34s. Estimated remaining time: 6h 19m 6s. Estimated total time: 15h 16m 55s. Time estimates for 10 more iterations: 9m 10s, 100 more iterations: 1h 31m 41s, 500 more iterations: 7h 38m 27s.
[2025-08-20 17:08:19,954][__main__][INFO] - Starting iteration 577.
[2025-08-20 17:08:43,123][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 17:08:43,124][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 17:08:43,130][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 17:08:45,587][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 17:08:45,588][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 17:08:45,594][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 17:08:45,596][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data.
[2025-08-20 17:08:45,597][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets.
[2025-08-20 17:08:45,898][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:08:46,689][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:08:47,481][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:08:48,273][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:08:49,069][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:08:49,863][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:08:50,656][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:08:51,450][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:08:52,245][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:08:53,038][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:08:53,832][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:08:54,626][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:08:55,419][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:08:56,212][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:08:57,008][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:08:57,804][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:08:58,598][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:08:59,392][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:09:00,189][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:09:00,986][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:09:01,782][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:09:02,576][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:09:03,866][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:09:04,661][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:09:05,455][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:09:06,249][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:09:07,045][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:09:07,842][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:09:08,636][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:09:09,431][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:09:10,228][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:09:11,025][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:09:12,643][mllm.training.trainer_common][INFO] - For task: Apply reinforce step
,
            ΔVRAM Allocated: 2.288818359375e-05 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:27,
            Percentage of VRAM taken: 69.91579512747062%,
            
[2025-08-20 17:09:13,690][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/policy_optimizer_state.pt
[2025-08-20 17:09:13,692][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/critic_optimizer_state.pt
[2025-08-20 17:09:14,961][__main__][INFO] - Iteration 578 took 55s (37.64% Gen, 62.36% Train). Generation: 20s, Training: 34s. Estimated remaining time: 6h 18m 3s. Estimated total time: 15h 16m 47s. Time estimates for 10 more iterations: 9m 10s, 100 more iterations: 1h 31m 40s, 500 more iterations: 7h 38m 23s.
[2025-08-20 17:09:14,963][__main__][INFO] - Starting iteration 578.
[2025-08-20 17:09:38,002][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 17:09:38,004][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 17:09:38,010][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 17:09:40,432][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 17:09:40,433][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 17:09:40,439][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 17:09:40,441][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data.
[2025-08-20 17:09:40,442][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets.
[2025-08-20 17:09:40,739][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:09:41,528][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:09:42,317][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:09:43,110][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:09:43,900][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:09:44,692][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:09:45,483][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:09:46,273][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:09:47,066][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:09:47,857][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:09:48,648][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:09:49,439][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:09:50,231][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:09:51,023][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:09:51,814][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:09:52,610][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:09:53,402][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:09:54,193][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:09:54,988][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:09:55,781][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:09:56,572][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:09:57,367][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:09:58,607][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:09:59,402][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:10:00,196][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:10:00,991][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:10:01,788][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:10:02,584][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:10:03,378][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:10:04,173][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:10:04,969][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:10:05,766][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:10:07,332][mllm.training.trainer_common][INFO] - For task: Apply reinforce step
,
            ΔVRAM Allocated: 2.288818359375e-05 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:26,
            Percentage of VRAM taken: 69.91579512747062%,
            
[2025-08-20 17:10:08,284][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/policy_optimizer_state.pt
[2025-08-20 17:10:08,286][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/critic_optimizer_state.pt
[2025-08-20 17:10:09,901][__main__][INFO] - Iteration 579 took 54s (37.51% Gen, 62.48% Train). Generation: 20s, Training: 34s. Estimated remaining time: 6h 15m 58s. Estimated total time: 15h 15m 37s. Time estimates for 10 more iterations: 9m 9s, 100 more iterations: 1h 31m 33s, 500 more iterations: 7h 37m 48s.
[2025-08-20 17:10:09,903][__main__][INFO] - Starting iteration 579.
[2025-08-20 17:10:33,398][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 17:10:33,400][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 17:10:33,406][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 17:10:35,854][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 17:10:35,855][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 17:10:35,861][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 17:10:35,863][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data.
[2025-08-20 17:10:35,864][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets.
[2025-08-20 17:10:36,163][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:10:36,955][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:10:37,747][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:10:38,539][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:10:39,334][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:10:40,127][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:10:40,920][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:10:41,715][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:10:42,510][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:10:43,303][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:10:44,097][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:10:44,894][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:10:45,689][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:10:46,482][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:10:47,276][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:10:48,073][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:10:48,868][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:10:49,661][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:10:50,455][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:10:51,252][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:10:52,049][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:10:53,279][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:10:54,073][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:10:54,869][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:10:55,663][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:10:56,456][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:10:57,252][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:10:58,049][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:10:58,846][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:10:59,640][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:11:00,435][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:11:01,231][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:11:02,823][mllm.training.trainer_common][INFO] - For task: Apply reinforce step
,
            ΔVRAM Allocated: 2.288818359375e-05 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:26,
            Percentage of VRAM taken: 69.91579512747062%,
            
[2025-08-20 17:11:03,782][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/policy_optimizer_state.pt
[2025-08-20 17:11:03,784][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/critic_optimizer_state.pt
[2025-08-20 17:11:05,212][__main__][INFO] - Iteration 580 took 55s (38.03% Gen, 61.96% Train). Generation: 21s, Training: 34s. Estimated remaining time: 6h 21m 14s. Estimated total time: 15h 21m 49s. Time estimates for 10 more iterations: 9m 13s, 100 more iterations: 1h 32m 10s, 500 more iterations: 7h 40m 54s.
[2025-08-20 17:11:05,214][__main__][INFO] - Starting iteration 580.
[2025-08-20 17:11:28,232][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 17:11:28,233][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 17:11:28,239][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 17:11:30,671][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 17:11:30,672][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 17:11:30,679][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 17:11:30,681][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data.
[2025-08-20 17:11:30,681][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets.
[2025-08-20 17:11:30,984][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:11:31,777][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:11:32,568][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:11:33,362][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:11:34,158][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:11:34,951][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:11:35,744][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:11:36,539][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:11:37,334][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:11:38,127][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:11:38,920][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:11:39,716][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:11:40,512][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:11:41,305][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:11:42,099][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:11:42,895][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:11:43,690][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:11:44,483][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:11:45,278][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:11:46,074][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:11:46,870][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:11:47,665][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:11:48,458][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:11:49,769][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:11:50,563][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:11:51,356][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:11:52,153][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:11:52,949][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:11:53,742][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:11:54,536][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:11:55,333][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:11:56,131][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:11:57,734][mllm.training.trainer_common][INFO] - For task: Apply reinforce step
,
            ΔVRAM Allocated: 2.288818359375e-05 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:27,
            Percentage of VRAM taken: 69.91579512747062%,
            
[2025-08-20 17:11:58,716][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/policy_optimizer_state.pt
[2025-08-20 17:11:58,717][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/critic_optimizer_state.pt
[2025-08-20 17:12:00,120][__main__][INFO] - Iteration 581 took 54s (37.48% Gen, 62.52% Train). Generation: 20s, Training: 34s. Estimated remaining time: 6h 13m 36s. Estimated total time: 15h 15m 5s. Time estimates for 10 more iterations: 9m 9s, 100 more iterations: 1h 31m 30s, 500 more iterations: 7h 37m 32s.
[2025-08-20 17:12:00,122][__main__][INFO] - Starting iteration 581.
[2025-08-20 17:12:23,271][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 17:12:23,272][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 17:12:23,279][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 17:12:25,726][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 17:12:25,727][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 17:12:25,734][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 17:12:25,736][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data.
[2025-08-20 17:12:25,737][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets.
[2025-08-20 17:12:26,035][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:12:26,824][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:12:27,620][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:12:28,413][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:12:29,205][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:12:30,000][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:12:30,794][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:12:31,721][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:12:32,642][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:12:33,438][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:12:34,231][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:12:35,023][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:12:35,818][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:12:36,612][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:12:37,406][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:12:38,200][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:12:38,995][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:12:39,788][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:12:40,581][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:12:41,377][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:12:42,625][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:12:43,420][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:12:44,217][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:12:45,011][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:12:45,804][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:12:46,598][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:12:47,396][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:12:48,192][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:12:48,986][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:12:49,780][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:12:50,577][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:12:51,377][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:12:53,000][mllm.training.trainer_common][INFO] - For task: Apply reinforce step
,
            ΔVRAM Allocated: 2.288818359375e-05 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:27,
            Percentage of VRAM taken: 69.91579512747062%,
            
[2025-08-20 17:12:53,949][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/policy_optimizer_state.pt
[2025-08-20 17:12:53,950][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/critic_optimizer_state.pt
[2025-08-20 17:12:55,399][__main__][INFO] - Iteration 582 took 55s (37.45% Gen, 62.54% Train). Generation: 20s, Training: 34s. Estimated remaining time: 6h 18m 52s. Estimated total time: 15h 21m 16s. Time estimates for 10 more iterations: 9m 12s, 100 more iterations: 1h 32m 7s, 500 more iterations: 7h 40m 38s.
[2025-08-20 17:12:55,400][__main__][INFO] - Starting iteration 582.
[2025-08-20 17:13:18,514][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 17:13:18,515][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 17:13:18,522][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 17:13:20,966][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 17:13:20,968][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 17:13:20,974][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 17:13:20,976][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data.
[2025-08-20 17:13:20,976][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets.
[2025-08-20 17:13:21,276][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:13:22,064][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:13:22,855][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:13:23,647][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:13:24,442][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:13:25,235][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:13:26,027][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:13:26,822][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:13:27,615][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:13:28,406][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:13:29,202][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:13:29,996][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:13:30,789][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:13:31,583][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:13:32,377][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:13:33,169][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:13:33,963][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:13:34,758][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:13:35,551][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:13:36,343][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:13:37,141][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:13:37,936][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:13:39,209][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:13:40,003][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:13:40,800][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:13:41,596][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:13:42,390][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:13:43,184][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:13:43,980][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:13:44,777][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:13:45,575][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:13:46,371][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:13:47,972][mllm.training.trainer_common][INFO] - For task: Apply reinforce step
,
            ΔVRAM Allocated: 2.288818359375e-05 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:26,
            Percentage of VRAM taken: 69.91579512747062%,
            
[2025-08-20 17:13:49,147][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/policy_optimizer_state.pt
[2025-08-20 17:13:49,149][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/critic_optimizer_state.pt
[2025-08-20 17:13:50,874][__main__][INFO] - Iteration 583 took 55s (37.26% Gen, 62.74% Train). Generation: 20s, Training: 34s. Estimated remaining time: 6h 21m 13s. Estimated total time: 15h 24m 33s. Time estimates for 10 more iterations: 9m 14s, 100 more iterations: 1h 32m 27s, 500 more iterations: 7h 42m 16s.
[2025-08-20 17:13:50,875][__main__][INFO] - Starting iteration 583.
[2025-08-20 17:14:14,031][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 17:14:14,032][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 17:14:14,039][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 17:14:16,499][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 17:14:16,500][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 17:14:16,507][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 17:14:16,509][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data.
[2025-08-20 17:14:16,509][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets.
[2025-08-20 17:14:16,808][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:14:17,600][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:14:18,392][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:14:19,185][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:14:19,978][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:14:20,770][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:14:21,566][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:14:22,360][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:14:23,153][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:14:23,947][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:14:24,742][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:14:25,534][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:14:26,327][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:14:27,125][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:14:27,921][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:14:28,714][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:14:29,509][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:14:30,798][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:14:31,592][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:14:32,386][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:14:33,181][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:14:33,975][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:14:34,768][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:14:35,564][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:14:36,360][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:14:37,156][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:14:37,949][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:14:38,744][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:14:39,541][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:14:40,336][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:14:41,131][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:14:41,924][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:14:43,604][mllm.training.trainer_common][INFO] - For task: Apply reinforce step
,
            ΔVRAM Allocated: 2.288818359375e-05 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:27,
            Percentage of VRAM taken: 69.91579512747062%,
            
[2025-08-20 17:14:44,767][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/policy_optimizer_state.pt
[2025-08-20 17:14:44,770][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/critic_optimizer_state.pt
[2025-08-20 17:14:46,142][__main__][INFO] - Iteration 584 took 55s (37.45% Gen, 62.55% Train). Generation: 20s, Training: 34s. Estimated remaining time: 6h 16m 51s. Estimated total time: 15h 21m 6s. Time estimates for 10 more iterations: 9m 12s, 100 more iterations: 1h 32m 6s, 500 more iterations: 7h 40m 33s.
[2025-08-20 17:14:46,144][__main__][INFO] - Starting iteration 584.
[2025-08-20 17:15:09,287][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 17:15:09,288][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 17:15:09,295][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 17:15:11,744][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 17:15:11,745][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 17:15:11,751][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 17:15:11,754][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data.
[2025-08-20 17:15:11,754][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets.
[2025-08-20 17:15:12,052][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:15:12,845][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:15:13,635][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:15:14,430][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:15:15,224][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:15:16,015][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:15:16,810][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:15:17,605][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:15:18,397][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:15:19,190][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:15:19,986][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:15:20,780][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:15:21,573][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:15:22,368][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:15:23,164][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:15:23,959][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:15:24,752][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:15:25,548][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:15:26,343][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:15:27,137][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:15:27,930][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:15:29,139][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:15:29,933][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:15:30,728][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:15:31,524][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:15:32,319][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:15:33,112][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:15:33,906][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:15:34,705][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:15:35,503][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:15:36,299][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:15:37,093][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:15:38,684][mllm.training.trainer_common][INFO] - For task: Apply reinforce step
,
            ΔVRAM Allocated: 2.288818359375e-05 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:26,
            Percentage of VRAM taken: 69.91579512747062%,
            
[2025-08-20 17:15:39,668][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/policy_optimizer_state.pt
[2025-08-20 17:15:39,670][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/critic_optimizer_state.pt
[2025-08-20 17:15:41,363][__main__][INFO] - Iteration 585 took 55s (37.48% Gen, 62.51% Train). Generation: 20s, Training: 34s. Estimated remaining time: 6h 15m 8s. Estimated total time: 15h 20m 18s. Time estimates for 10 more iterations: 9m 12s, 100 more iterations: 1h 32m 1s, 500 more iterations: 7h 40m 9s.
[2025-08-20 17:15:41,365][__main__][INFO] - Starting iteration 585.
[2025-08-20 17:16:04,794][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 17:16:04,796][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 17:16:04,802][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 17:16:07,261][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 17:16:07,262][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 17:16:07,269][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 17:16:07,271][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data.
[2025-08-20 17:16:07,271][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets.
[2025-08-20 17:16:07,572][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:16:08,362][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:16:09,154][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:16:09,947][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:16:10,738][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:16:11,531][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:16:12,324][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:16:13,115][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:16:13,910][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:16:14,703][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:16:15,495][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:16:16,290][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:16:17,084][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:16:17,876][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:16:18,670][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:16:19,465][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:16:20,257][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:16:21,051][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:16:21,848][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:16:22,642][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:16:23,910][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:16:24,703][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:16:25,494][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:16:26,286][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:16:27,078][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:16:27,869][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:16:28,664][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:16:29,457][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:16:30,250][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:16:31,045][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:16:31,839][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:16:32,631][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:16:34,239][mllm.training.trainer_common][INFO] - For task: Apply reinforce step
,
            ΔVRAM Allocated: 2.288818359375e-05 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:26,
            Percentage of VRAM taken: 69.91579512747062%,
            
[2025-08-20 17:16:35,175][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/policy_optimizer_state.pt
[2025-08-20 17:16:35,177][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/critic_optimizer_state.pt
[2025-08-20 17:16:36,782][__main__][INFO] - Iteration 586 took 55s (37.86% Gen, 62.14% Train). Generation: 20s, Training: 34s. Estimated remaining time: 6h 17m 31s. Estimated total time: 15h 23m 37s. Time estimates for 10 more iterations: 9m 14s, 100 more iterations: 1h 32m 21s, 500 more iterations: 7h 41m 48s.
[2025-08-20 17:16:36,784][__main__][INFO] - Starting iteration 586.
[2025-08-20 17:16:59,879][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 17:16:59,880][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 17:16:59,886][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 17:17:02,349][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 17:17:02,350][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 17:17:02,357][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 17:17:02,359][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data.
[2025-08-20 17:17:02,360][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets.
[2025-08-20 17:17:02,658][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:17:03,451][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:17:04,243][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:17:05,036][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:17:05,830][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:17:06,622][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:17:07,415][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:17:08,210][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:17:09,003][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:17:09,796][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:17:10,592][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:17:11,384][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:17:12,177][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:17:12,974][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:17:13,770][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:17:14,564][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:17:15,357][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:17:16,153][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:17:16,949][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:17:18,230][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:17:19,025][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:17:19,818][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:17:20,613][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:17:21,409][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:17:22,204][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:17:22,998][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:17:23,792][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:17:24,591][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:17:25,389][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:17:26,184][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:17:26,979][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:17:27,773][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:17:29,421][mllm.training.trainer_common][INFO] - For task: Apply reinforce step
,
            ΔVRAM Allocated: 2.288818359375e-05 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:27,
            Percentage of VRAM taken: 69.91579512747062%,
            
[2025-08-20 17:17:30,674][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/policy_optimizer_state.pt
[2025-08-20 17:17:30,676][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/critic_optimizer_state.pt
[2025-08-20 17:17:31,947][__main__][INFO] - Iteration 587 took 55s (37.43% Gen, 62.57% Train). Generation: 20s, Training: 34s. Estimated remaining time: 6h 12m 21s. Estimated total time: 15h 19m 22s. Time estimates for 10 more iterations: 9m 11s, 100 more iterations: 1h 31m 56s, 500 more iterations: 7h 39m 41s.
[2025-08-20 17:17:31,948][__main__][INFO] - Starting iteration 587.
[2025-08-20 17:17:54,985][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 17:17:54,987][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 17:17:54,993][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 17:17:57,436][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 17:17:57,437][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 17:17:57,444][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 17:17:57,446][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data.
[2025-08-20 17:17:57,447][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets.
[2025-08-20 17:17:57,745][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:17:58,536][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:17:59,327][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:18:00,120][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:18:00,915][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:18:01,707][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:18:02,499][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:18:03,294][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:18:04,087][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:18:04,879][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:18:05,675][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:18:06,469][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:18:07,262][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:18:08,057][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:18:08,852][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:18:09,645][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:18:10,438][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:18:11,235][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:18:12,030][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:18:12,823][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:18:13,617][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:18:14,857][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:18:15,654][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:18:16,450][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:18:17,243][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:18:18,036][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:18:18,834][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:18:19,632][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:18:20,428][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:18:21,221][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:18:22,016][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:18:22,813][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:18:24,425][mllm.training.trainer_common][INFO] - For task: Apply reinforce step
,
            ΔVRAM Allocated: 2.288818359375e-05 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:26,
            Percentage of VRAM taken: 69.91579512747062%,
            
[2025-08-20 17:18:25,408][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/policy_optimizer_state.pt
[2025-08-20 17:18:25,410][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/critic_optimizer_state.pt
[2025-08-20 17:18:26,770][__main__][INFO] - Iteration 588 took 54s (37.57% Gen, 62.43% Train). Generation: 20s, Training: 34s. Estimated remaining time: 6h 5m 45s. Estimated total time: 15h 13m 41s. Time estimates for 10 more iterations: 9m 8s, 100 more iterations: 1h 31m 22s, 500 more iterations: 7h 36m 50s.
[2025-08-20 17:18:26,771][__main__][INFO] - Starting iteration 588.
[2025-08-20 17:18:49,813][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 17:18:49,814][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 17:18:49,820][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 17:18:52,277][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 17:18:52,278][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 17:18:52,285][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 17:18:52,287][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data.
[2025-08-20 17:18:52,287][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets.
[2025-08-20 17:18:52,588][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:18:53,378][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:18:54,170][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:18:54,963][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:18:55,757][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:18:56,548][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:18:57,341][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:18:58,136][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:18:58,929][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:18:59,723][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:19:00,518][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:19:01,313][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:19:02,105][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:19:02,900][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:19:03,696][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:19:04,491][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:19:05,284][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:19:06,079][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:19:06,875][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:19:07,669][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:19:08,461][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:19:09,258][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:19:10,543][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:19:11,338][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:19:12,135][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:19:12,931][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:19:13,725][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:19:14,519][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:19:15,316][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:19:16,114][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:19:16,910][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:19:17,705][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:19:19,323][mllm.training.trainer_common][INFO] - For task: Apply reinforce step
,
            ΔVRAM Allocated: 2.288818359375e-05 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:26,
            Percentage of VRAM taken: 69.91579512747062%,
            
[2025-08-20 17:19:20,262][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/policy_optimizer_state.pt
[2025-08-20 17:19:20,264][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/critic_optimizer_state.pt
[2025-08-20 17:19:21,939][__main__][INFO] - Iteration 589 took 55s (37.31% Gen, 62.69% Train). Generation: 20s, Training: 34s. Estimated remaining time: 6h 10m 36s. Estimated total time: 15h 19m 27s. Time estimates for 10 more iterations: 9m 11s, 100 more iterations: 1h 31m 56s, 500 more iterations: 7h 39m 43s.
[2025-08-20 17:19:21,941][__main__][INFO] - Starting iteration 589.
[2025-08-20 17:19:44,890][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 17:19:44,891][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 17:19:44,898][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 17:19:47,325][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 17:19:47,326][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 17:19:47,332][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 17:19:47,335][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data.
[2025-08-20 17:19:47,335][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets.
[2025-08-20 17:19:47,633][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:19:48,422][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:19:49,214][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:19:50,006][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:19:50,802][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:19:51,594][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:19:52,387][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:19:53,182][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:19:53,976][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:19:54,769][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:19:55,564][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:19:56,362][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:19:57,157][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:19:57,950][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:19:58,743][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:19:59,541][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:20:00,336][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:20:01,129][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:20:01,922][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:20:02,719][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:20:03,948][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:20:04,742][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:20:05,538][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:20:06,332][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:20:07,126][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:20:07,920][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:20:08,718][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:20:09,512][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:20:10,305][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:20:11,100][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:20:11,899][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:20:12,698][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:20:14,395][mllm.training.trainer_common][INFO] - For task: Apply reinforce step
,
            ΔVRAM Allocated: 2.288818359375e-05 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:27,
            Percentage of VRAM taken: 69.91579512747062%,
            
[2025-08-20 17:20:15,341][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/policy_optimizer_state.pt
[2025-08-20 17:20:15,342][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/critic_optimizer_state.pt
[2025-08-20 17:20:16,755][__main__][INFO] - Iteration 590 took 54s (37.44% Gen, 62.56% Train). Generation: 20s, Training: 34s. Estimated remaining time: 6h 3m 48s. Estimated total time: 15h 13m 33s. Time estimates for 10 more iterations: 9m 8s, 100 more iterations: 1h 31m 21s, 500 more iterations: 7h 36m 46s.
[2025-08-20 17:20:16,757][__main__][INFO] - Starting iteration 590.
[2025-08-20 17:20:40,214][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 17:20:40,215][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 17:20:40,221][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 17:20:42,671][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 17:20:42,673][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 17:20:42,679][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 17:20:42,681][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data.
[2025-08-20 17:20:42,682][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets.
[2025-08-20 17:20:42,980][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:20:43,770][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:20:44,565][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:20:45,356][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:20:46,148][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:20:46,942][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:20:47,733][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:20:48,525][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:20:49,318][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:20:50,110][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:20:50,904][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:20:51,699][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:20:52,490][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:20:53,284][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:20:54,077][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:20:54,870][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:20:55,664][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:20:56,460][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:20:57,252][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:20:58,045][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:20:58,842][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:21:00,093][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:21:00,886][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:21:01,682][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:21:02,478][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:21:03,273][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:21:04,066][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:21:04,862][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:21:05,660][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:21:06,458][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:21:07,254][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:21:08,048][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:21:09,655][mllm.training.trainer_common][INFO] - For task: Apply reinforce step
,
            ΔVRAM Allocated: 2.288818359375e-05 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:26,
            Percentage of VRAM taken: 69.91579512747062%,
            
[2025-08-20 17:21:10,594][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/policy_optimizer_state.pt
[2025-08-20 17:21:10,595][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/critic_optimizer_state.pt
[2025-08-20 17:21:12,840][__main__][INFO] - Iteration 591 took 56s (37.46% Gen, 62.54% Train). Generation: 21s, Training: 35s. Estimated remaining time: 6h 24m 1s. Estimated total time: 15h 34m 43s. Time estimates for 10 more iterations: 9m 20s, 100 more iterations: 1h 33m 28s, 500 more iterations: 7h 47m 21s.
[2025-08-20 17:21:12,842][__main__][INFO] - Starting iteration 591.
[2025-08-20 17:21:35,862][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 17:21:35,864][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 17:21:35,870][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 17:21:38,336][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 17:21:38,338][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 17:21:38,344][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 17:21:38,346][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data.
[2025-08-20 17:21:38,347][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets.
[2025-08-20 17:21:38,647][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:21:39,437][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:21:40,229][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:21:41,023][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:21:41,814][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:21:42,609][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:21:43,403][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:21:44,195][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:21:44,988][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:21:45,784][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:21:46,577][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:21:47,369][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:21:48,166][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:21:48,961][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:21:49,755][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:21:50,548][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:21:51,345][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:21:52,140][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:21:52,933][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:21:53,726][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:21:54,980][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:21:55,772][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:21:56,565][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:21:57,360][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:21:58,154][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:21:58,947][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:21:59,743][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:22:00,538][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:22:01,332][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:22:02,126][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:22:02,925][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:22:03,723][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:22:05,374][mllm.training.trainer_common][INFO] - For task: Apply reinforce step
,
            ΔVRAM Allocated: 2.288818359375e-05 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:26,
            Percentage of VRAM taken: 69.91579512747062%,
            
[2025-08-20 17:22:06,510][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/policy_optimizer_state.pt
[2025-08-20 17:22:06,512][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/critic_optimizer_state.pt
[2025-08-20 17:22:07,903][__main__][INFO] - Iteration 592 took 55s (37.33% Gen, 62.67% Train). Generation: 20s, Training: 34s. Estimated remaining time: 6h 6m 4s. Estimated total time: 15h 17m 41s. Time estimates for 10 more iterations: 9m 10s, 100 more iterations: 1h 31m 46s, 500 more iterations: 7h 38m 50s.
[2025-08-20 17:22:07,905][__main__][INFO] - Starting iteration 592.
[2025-08-20 17:22:30,800][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 17:22:30,801][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 17:22:30,807][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 17:22:33,245][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 17:22:33,246][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 17:22:33,252][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 17:22:33,254][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data.
[2025-08-20 17:22:33,255][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets.
[2025-08-20 17:22:33,553][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:22:34,345][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:22:35,136][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:22:35,931][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:22:36,724][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:22:37,516][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:22:38,311][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:22:39,106][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:22:39,899][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:22:40,693][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:22:41,489][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:22:42,284][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:22:43,077][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:22:43,872][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:22:44,671][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:22:45,468][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:22:46,264][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:22:47,056][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:22:47,851][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:22:48,648][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:22:49,443][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:22:50,674][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:22:51,469][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:22:52,267][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:22:53,062][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:22:53,855][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:22:54,649][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:22:55,447][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:22:56,245][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:22:57,040][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:22:57,835][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:22:58,629][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:23:00,294][mllm.training.trainer_common][INFO] - For task: Apply reinforce step
,
            ΔVRAM Allocated: 2.288818359375e-05 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:27,
            Percentage of VRAM taken: 69.91579512747062%,
            
[2025-08-20 17:23:01,239][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/policy_optimizer_state.pt
[2025-08-20 17:23:01,240][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/critic_optimizer_state.pt
[2025-08-20 17:23:07,786][__main__][INFO] - Iteration 593 took 59s (34.16% Gen, 65.84% Train). Generation: 20s, Training: 39s. Estimated remaining time: 7h 25m 20s. Estimated total time: 16h 37m 56s. Time estimates for 10 more iterations: 9m 58s, 100 more iterations: 1h 39m 47s, 500 more iterations: 8h 18m 58s.
[2025-08-20 17:23:07,787][__main__][INFO] - Starting iteration 593.
[2025-08-20 17:23:30,801][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 17:23:30,803][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 17:23:30,809][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 17:23:33,250][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 17:23:33,251][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 17:23:33,257][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 17:23:33,260][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data.
[2025-08-20 17:23:33,260][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets.
[2025-08-20 17:23:33,559][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:23:34,350][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:23:35,141][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:23:35,932][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:23:36,725][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:23:37,517][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:23:38,311][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:23:39,103][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:23:39,895][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:23:40,690][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:23:41,484][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:23:42,275][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:23:43,070][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:23:43,863][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:23:44,655][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:23:45,450][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:23:46,246][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:23:47,040][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:23:47,832][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:23:48,629][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:23:49,866][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:23:50,658][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:23:51,451][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:23:52,248][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:23:53,044][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:23:53,838][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:23:54,632][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:23:55,428][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:23:56,224][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:23:57,019][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:23:57,814][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:23:58,609][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:24:00,226][mllm.training.trainer_common][INFO] - For task: Apply reinforce step
,
            ΔVRAM Allocated: 2.288818359375e-05 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:26,
            Percentage of VRAM taken: 69.91579512747062%,
            
[2025-08-20 17:24:01,257][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/policy_optimizer_state.pt
[2025-08-20 17:24:01,259][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/critic_optimizer_state.pt
[2025-08-20 17:24:02,643][__main__][INFO] - Iteration 594 took 54s (37.50% Gen, 62.50% Train). Generation: 20s, Training: 34s. Estimated remaining time: 6h 0m 43s. Estimated total time: 15h 14m 15s. Time estimates for 10 more iterations: 9m 8s, 100 more iterations: 1h 31m 25s, 500 more iterations: 7h 37m 7s.
[2025-08-20 17:24:02,645][__main__][INFO] - Starting iteration 594.
[2025-08-20 17:24:25,686][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 17:24:25,687][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 17:24:25,694][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 17:24:28,168][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 17:24:28,170][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 17:24:28,176][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 17:24:28,178][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data.
[2025-08-20 17:24:28,179][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets.
[2025-08-20 17:24:28,477][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:24:29,269][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:24:30,061][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:24:30,854][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:24:31,646][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:24:32,437][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:24:33,234][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:24:34,028][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:24:34,820][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:24:35,614][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:24:36,408][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:24:37,200][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:24:37,995][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:24:38,790][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:24:39,582][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:24:40,375][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:24:41,659][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:24:42,453][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:24:43,248][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:24:44,042][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:24:44,836][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:24:45,632][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:24:46,430][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:24:47,225][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:24:48,019][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:24:48,814][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:24:49,611][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:24:50,408][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:24:51,203][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:24:51,998][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:24:52,793][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:24:53,591][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:24:55,256][mllm.training.trainer_common][INFO] - For task: Apply reinforce step
,
            ΔVRAM Allocated: 2.288818359375e-05 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:27,
            Percentage of VRAM taken: 69.91579512747062%,
            
[2025-08-20 17:24:56,228][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/policy_optimizer_state.pt
[2025-08-20 17:24:56,229][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/critic_optimizer_state.pt
[2025-08-20 17:24:57,518][__main__][INFO] - Iteration 595 took 54s (37.53% Gen, 62.47% Train). Generation: 20s, Training: 34s. Estimated remaining time: 6h 0m 5s. Estimated total time: 15h 14m 32s. Time estimates for 10 more iterations: 9m 8s, 100 more iterations: 1h 31m 27s, 500 more iterations: 7h 37m 16s.
[2025-08-20 17:24:57,519][__main__][INFO] - Starting iteration 595.
[2025-08-20 17:25:20,801][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 17:25:20,802][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 17:25:20,808][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 17:25:23,257][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 17:25:23,258][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 17:25:23,264][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 17:25:23,267][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data.
[2025-08-20 17:25:23,267][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets.
[2025-08-20 17:25:23,565][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:25:24,356][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:25:25,147][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:25:25,939][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:25:26,731][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:25:27,523][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:25:28,317][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:25:29,111][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:25:29,907][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:25:30,699][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:25:31,496][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:25:32,289][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:25:33,082][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:25:33,877][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:25:34,674][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:25:35,466][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:25:36,259][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:25:37,056][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:25:37,851][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:25:39,067][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:25:39,860][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:25:40,654][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:25:41,451][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:25:42,246][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:25:43,040][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:25:43,836][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:25:44,634][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:25:45,432][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:25:46,229][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:25:47,025][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:25:47,820][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:25:48,616][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:25:50,256][mllm.training.trainer_common][INFO] - For task: Apply reinforce step
,
            ΔVRAM Allocated: 2.288818359375e-05 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:26,
            Percentage of VRAM taken: 69.91579512747062%,
            
[2025-08-20 17:25:51,466][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/policy_optimizer_state.pt
[2025-08-20 17:25:51,468][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/critic_optimizer_state.pt
[2025-08-20 17:25:52,891][__main__][INFO] - Iteration 596 took 55s (37.64% Gen, 62.35% Train). Generation: 20s, Training: 34s. Estimated remaining time: 6h 7m 29s. Estimated total time: 15h 22m 51s. Time estimates for 10 more iterations: 9m 13s, 100 more iterations: 1h 32m 17s, 500 more iterations: 7h 41m 25s.
[2025-08-20 17:25:52,893][__main__][INFO] - Starting iteration 596.
[2025-08-20 17:26:15,892][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 17:26:15,893][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 17:26:15,900][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 17:26:18,349][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 17:26:18,351][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 17:26:18,357][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 17:26:18,359][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data.
[2025-08-20 17:26:18,360][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets.
[2025-08-20 17:26:18,659][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:26:19,448][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:26:20,241][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:26:21,035][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:26:21,828][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:26:22,623][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:26:23,417][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:26:24,210][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:26:25,003][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:26:25,799][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:26:26,593][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:26:27,385][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:26:28,180][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:26:28,976][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:26:29,769][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:26:30,562][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:26:31,358][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:26:32,154][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:26:32,949][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:26:34,268][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:26:35,062][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:26:35,858][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:26:36,657][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:26:37,453][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:26:38,247][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:26:39,041][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:26:39,838][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:26:40,636][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:26:41,432][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:26:42,228][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:26:43,022][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:26:43,818][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:26:45,433][mllm.training.trainer_common][INFO] - For task: Apply reinforce step
,
            ΔVRAM Allocated: 2.288818359375e-05 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:27,
            Percentage of VRAM taken: 69.91579512747062%,
            
[2025-08-20 17:26:46,379][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/policy_optimizer_state.pt
[2025-08-20 17:26:46,380][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/critic_optimizer_state.pt
[2025-08-20 17:26:47,844][__main__][INFO] - Iteration 597 took 54s (37.41% Gen, 62.59% Train). Generation: 20s, Training: 34s. Estimated remaining time: 5h 59m 34s. Estimated total time: 15h 15m 50s. Time estimates for 10 more iterations: 9m 9s, 100 more iterations: 1h 31m 35s, 500 more iterations: 7h 37m 55s.
[2025-08-20 17:26:47,846][__main__][INFO] - Starting iteration 597.
[2025-08-20 17:27:10,780][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 17:27:10,782][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 17:27:10,788][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 17:27:13,276][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 17:27:13,277][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 17:27:13,283][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 17:27:13,285][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data.
[2025-08-20 17:27:13,286][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets.
[2025-08-20 17:27:13,585][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:27:14,375][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:27:15,167][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:27:15,962][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:27:16,756][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:27:17,548][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:27:18,342][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:27:19,136][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:27:19,928][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:27:20,723][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:27:21,518][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:27:22,311][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:27:23,105][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:27:23,901][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:27:24,696][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:27:25,490][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:27:26,283][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:27:27,080][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:27:27,877][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:27:28,670][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:27:29,976][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:27:30,769][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:27:31,564][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:27:32,362][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:27:33,160][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:27:33,957][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:27:34,752][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:27:35,548][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:27:36,343][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:27:37,141][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:27:37,939][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:27:38,737][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:27:40,381][mllm.training.trainer_common][INFO] - For task: Apply reinforce step
,
            ΔVRAM Allocated: 2.288818359375e-05 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:27,
            Percentage of VRAM taken: 69.91579512747062%,
            
[2025-08-20 17:27:41,367][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/policy_optimizer_state.pt
[2025-08-20 17:27:41,368][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/critic_optimizer_state.pt
[2025-08-20 17:27:42,713][__main__][INFO] - Iteration 598 took 54s (37.29% Gen, 62.71% Train). Generation: 20s, Training: 34s. Estimated remaining time: 5h 57m 15s. Estimated total time: 15h 14m 27s. Time estimates for 10 more iterations: 9m 8s, 100 more iterations: 1h 31m 26s, 500 more iterations: 7h 37m 13s.
[2025-08-20 17:27:42,715][__main__][INFO] - Starting iteration 598.
[2025-08-20 17:28:05,639][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 17:28:05,640][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 17:28:05,646][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 17:28:08,084][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 17:28:08,085][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 17:28:08,091][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 17:28:08,093][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data.
[2025-08-20 17:28:08,094][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets.
[2025-08-20 17:28:08,391][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:28:09,183][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:28:09,973][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:28:10,766][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:28:11,557][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:28:12,350][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:28:13,145][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:28:13,937][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:28:14,728][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:28:17,209][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:28:18,004][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:28:18,797][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:28:19,589][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:28:20,384][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:28:21,178][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:28:21,971][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:28:22,766][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:28:23,563][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:28:24,358][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:28:25,152][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:28:25,946][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:28:26,744][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:28:28,000][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:28:28,795][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:28:29,590][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:28:30,385][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:28:31,183][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:28:31,981][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:28:32,778][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:28:33,574][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:28:34,368][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:28:35,164][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:28:36,817][mllm.training.trainer_common][INFO] - For task: Apply reinforce step
,
            ΔVRAM Allocated: 2.288818359375e-05 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:28,
            Percentage of VRAM taken: 69.91579512747062%,
            
[2025-08-20 17:28:37,789][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/policy_optimizer_state.pt
[2025-08-20 17:28:37,790][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/critic_optimizer_state.pt
[2025-08-20 17:28:39,253][__main__][INFO] - Iteration 599 took 56s (36.23% Gen, 63.77% Train). Generation: 20s, Training: 36s. Estimated remaining time: 6h 24m 9s. Estimated total time: 15h 42m 17s. Time estimates for 10 more iterations: 9m 25s, 100 more iterations: 1h 34m 13s, 500 more iterations: 7h 51m 8s.
[2025-08-20 17:28:39,254][__main__][INFO] - Starting iteration 599.
[2025-08-20 17:29:02,398][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 17:29:02,399][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 17:29:02,405][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 17:29:04,868][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 17:29:04,869][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 17:29:04,876][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 17:29:04,878][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data.
[2025-08-20 17:29:04,878][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets.
[2025-08-20 17:29:05,177][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:29:05,966][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:29:06,756][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:29:07,549][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:29:08,341][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:29:09,132][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:29:09,926][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:29:10,718][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:29:11,509][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:29:12,303][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:29:13,095][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:29:13,888][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:29:14,681][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:29:15,473][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:29:16,267][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:29:17,062][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:29:17,854][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:29:18,648][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:29:19,927][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:29:20,724][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:29:21,518][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:29:22,313][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:29:23,108][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:29:23,906][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:29:24,704][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:29:25,501][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:29:26,296][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:29:27,090][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:29:27,887][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:29:28,685][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:29:29,484][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:29:30,281][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:29:31,907][mllm.training.trainer_common][INFO] - For task: Apply reinforce step
,
            ΔVRAM Allocated: 2.288818359375e-05 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:26,
            Percentage of VRAM taken: 69.91579512747062%,
            
[2025-08-20 17:29:32,852][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/policy_optimizer_state.pt
[2025-08-20 17:29:32,853][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/critic_optimizer_state.pt
[2025-08-20 17:29:34,209][__main__][INFO] - Iteration 600 took 54s (37.66% Gen, 62.34% Train). Generation: 20s, Training: 34s. Estimated remaining time: 5h 56m 50s. Estimated total time: 15h 15m 54s. Time estimates for 10 more iterations: 9m 9s, 100 more iterations: 1h 31m 35s, 500 more iterations: 7h 37m 57s.
[2025-08-20 17:29:34,210][__main__][INFO] - Starting iteration 600.
[2025-08-20 17:29:57,234][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 17:29:57,235][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 17:29:57,241][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 17:29:59,690][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 17:29:59,691][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 17:29:59,697][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 17:29:59,700][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data.
[2025-08-20 17:29:59,700][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets.
[2025-08-20 17:29:59,998][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:30:00,790][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:30:01,580][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:30:02,372][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:30:03,166][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:30:03,957][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:30:04,751][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:30:05,543][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:30:06,335][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:30:07,130][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:30:07,924][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:30:08,717][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:30:09,511][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:30:10,306][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:30:11,101][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:30:11,894][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:30:12,690][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:30:13,486][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:30:14,782][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:30:15,576][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:30:16,374][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:30:17,171][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:30:17,969][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:30:18,767][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:30:19,564][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:30:20,359][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:30:21,154][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:30:21,950][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:30:22,748][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:30:23,547][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:30:24,345][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:30:25,142][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:30:26,773][mllm.training.trainer_common][INFO] - For task: Apply reinforce step
,
            ΔVRAM Allocated: 2.288818359375e-05 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:27,
            Percentage of VRAM taken: 69.91579512747062%,
            
[2025-08-20 17:30:27,696][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/policy_optimizer_state.pt
[2025-08-20 17:30:27,697][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/critic_optimizer_state.pt
[2025-08-20 17:30:32,176][__main__][INFO] - Iteration 601 took 57s (35.47% Gen, 59.68% Train). Generation: 20s, Training: 34s. Estimated remaining time: 6h 46m 4s. Estimated total time: 16h 6m 5s. Time estimates for 10 more iterations: 9m 39s, 100 more iterations: 1h 36m 36s, 500 more iterations: 8h 3m 2s.
[2025-08-20 17:30:32,177][__main__][INFO] - Starting iteration 601.
[2025-08-20 17:30:55,397][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 17:30:55,399][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 17:30:55,405][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 17:30:57,856][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 17:30:57,857][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 17:30:57,864][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 17:30:57,866][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data.
[2025-08-20 17:30:57,866][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets.
[2025-08-20 17:30:58,172][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:30:58,961][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:30:59,755][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:31:00,548][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:31:01,341][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:31:02,134][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:31:02,928][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:31:03,721][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:31:04,514][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:31:05,309][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:31:06,102][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:31:06,895][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:31:07,692][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:31:08,487][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:31:09,281][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:31:10,075][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:31:10,872][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:31:11,669][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:31:12,892][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:31:13,689][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:31:14,483][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:31:15,277][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:31:16,072][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:31:16,870][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:31:17,666][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:31:18,461][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:31:19,255][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:31:20,051][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:31:20,849][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:31:21,645][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:31:22,440][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:31:23,234][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:31:24,878][mllm.training.trainer_common][INFO] - For task: Apply reinforce step
,
            ΔVRAM Allocated: 2.288818359375e-05 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:26,
            Percentage of VRAM taken: 69.91579512747062%,
            
[2025-08-20 17:31:25,838][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/policy_optimizer_state.pt
[2025-08-20 17:31:25,840][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/critic_optimizer_state.pt
[2025-08-20 17:31:29,499][__main__][INFO] - Iteration 602 took 57s (36.24% Gen, 63.75% Train). Generation: 20s, Training: 36s. Estimated remaining time: 6h 34m 22s. Estimated total time: 15h 55m 21s. Time estimates for 10 more iterations: 9m 33s, 100 more iterations: 1h 35m 32s, 500 more iterations: 7h 57m 40s.
[2025-08-20 17:31:29,501][__main__][INFO] - Starting iteration 602.
[2025-08-20 17:31:52,380][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 17:31:52,382][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 17:31:52,388][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 17:31:54,837][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 17:31:54,838][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 17:31:54,845][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 17:31:54,847][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data.
[2025-08-20 17:31:54,847][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets.
[2025-08-20 17:31:55,147][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:31:55,936][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:31:56,728][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:31:57,519][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:31:58,313][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:31:59,105][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:31:59,898][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:32:00,694][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:32:01,488][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:32:02,281][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:32:03,076][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:32:03,872][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:32:04,666][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:32:05,459][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:32:06,254][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:32:07,049][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:32:07,843][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:32:08,637][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:32:09,432][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:32:10,227][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:32:11,021][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:32:11,815][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:32:13,101][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:32:13,895][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:32:14,690][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:32:15,485][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:32:16,279][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:32:17,074][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:32:17,872][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:32:18,669][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:32:19,466][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:32:20,262][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:32:21,879][mllm.training.trainer_common][INFO] - For task: Apply reinforce step
,
            ΔVRAM Allocated: 2.288818359375e-05 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:26,
            Percentage of VRAM taken: 69.91579512747062%,
            
[2025-08-20 17:32:22,839][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/policy_optimizer_state.pt
[2025-08-20 17:32:22,840][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/critic_optimizer_state.pt
[2025-08-20 17:32:24,175][__main__][INFO] - Iteration 603 took 54s (37.36% Gen, 62.64% Train). Generation: 20s, Training: 34s. Estimated remaining time: 5h 49m 20s. Estimated total time: 15h 11m 13s. Time estimates for 10 more iterations: 9m 6s, 100 more iterations: 1h 31m 7s, 500 more iterations: 7h 35m 36s.
[2025-08-20 17:32:24,176][__main__][INFO] - Starting iteration 603.
[2025-08-20 17:32:47,137][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 17:32:47,138][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 17:32:47,144][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 17:32:49,569][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 17:32:49,570][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 17:32:49,577][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 17:32:49,579][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data.
[2025-08-20 17:32:49,579][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets.
[2025-08-20 17:32:49,877][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:32:50,667][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:32:51,460][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:32:52,254][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:32:53,046][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:32:53,840][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:32:54,637][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:32:55,430][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:32:56,222][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:32:57,018][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:32:57,813][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:32:58,606][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:32:59,401][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:33:00,197][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:33:00,993][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:33:01,787][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:33:02,582][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:33:03,378][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:33:04,176][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:33:04,973][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:33:05,768][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:33:06,561][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:33:07,357][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:33:08,686][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:33:09,479][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:33:10,275][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:33:11,071][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:33:11,865][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:33:12,659][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:33:13,455][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:33:14,252][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:33:15,048][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:33:16,678][mllm.training.trainer_common][INFO] - For task: Apply reinforce step
,
            ΔVRAM Allocated: 2.288818359375e-05 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:27,
            Percentage of VRAM taken: 69.91579512747062%,
            
[2025-08-20 17:33:17,635][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/policy_optimizer_state.pt
[2025-08-20 17:33:17,637][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/critic_optimizer_state.pt
[2025-08-20 17:33:18,878][__main__][INFO] - Iteration 604 took 54s (37.53% Gen, 62.47% Train). Generation: 20s, Training: 34s. Estimated remaining time: 5h 48m 53s. Estimated total time: 15h 11m 41s. Time estimates for 10 more iterations: 9m 7s, 100 more iterations: 1h 31m 10s, 500 more iterations: 7h 35m 50s.
[2025-08-20 17:33:18,879][__main__][INFO] - Starting iteration 604.
[2025-08-20 17:33:41,910][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 17:33:41,912][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 17:33:41,918][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 17:33:44,348][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 17:33:44,349][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 17:33:44,355][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 17:33:44,358][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data.
[2025-08-20 17:33:44,358][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets.
[2025-08-20 17:33:44,659][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:33:45,449][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:33:46,242][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:33:47,035][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:33:47,828][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:33:48,623][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:33:49,419][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:33:50,212][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:33:51,006][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:33:51,801][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:33:52,598][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:33:53,392][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:33:54,186][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:33:54,982][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:33:55,779][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:33:56,575][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:33:57,880][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:33:58,675][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:33:59,468][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:34:00,262][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:34:01,059][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:34:01,855][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:34:02,649][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:34:03,442][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:34:04,238][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:34:05,032][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:34:05,826][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:34:06,620][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:34:07,418][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:34:08,215][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:34:09,011][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:34:09,805][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:34:11,415][mllm.training.trainer_common][INFO] - For task: Apply reinforce step
,
            ΔVRAM Allocated: 2.288818359375e-05 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:27,
            Percentage of VRAM taken: 69.91579512747062%,
            
[2025-08-20 17:34:12,370][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/policy_optimizer_state.pt
[2025-08-20 17:34:12,372][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/critic_optimizer_state.pt
[2025-08-20 17:34:13,750][__main__][INFO] - Iteration 605 took 54s (37.54% Gen, 62.46% Train). Generation: 20s, Training: 34s. Estimated remaining time: 5h 50m 47s. Estimated total time: 15h 14m 30s. Time estimates for 10 more iterations: 9m 8s, 100 more iterations: 1h 31m 27s, 500 more iterations: 7h 37m 15s.
[2025-08-20 17:34:13,752][__main__][INFO] - Starting iteration 605.
[2025-08-20 17:34:36,645][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 17:34:36,647][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 17:34:36,653][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 17:34:39,148][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 17:34:39,149][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 17:34:39,155][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 17:34:39,158][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data.
[2025-08-20 17:34:39,158][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets.
[2025-08-20 17:34:39,459][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:34:40,249][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:34:41,044][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:34:41,836][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:34:42,630][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:34:43,426][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:34:44,221][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:34:45,015][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:34:45,810][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:34:46,604][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:34:47,401][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:34:48,195][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:34:48,989][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:34:49,785][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:34:50,583][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:34:51,380][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:34:52,175][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:34:53,513][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:34:54,306][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:34:55,103][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:34:55,900][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:34:56,696][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:34:57,490][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:34:58,284][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:34:59,082][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:34:59,879][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:35:00,676][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:35:01,469][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:35:02,263][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:35:03,060][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:35:03,858][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:35:04,654][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:35:06,244][mllm.training.trainer_common][INFO] - For task: Apply reinforce step
,
            ΔVRAM Allocated: 2.288818359375e-05 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:27,
            Percentage of VRAM taken: 69.91579512747062%,
            
[2025-08-20 17:35:07,221][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/policy_optimizer_state.pt
[2025-08-20 17:35:07,222][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/critic_optimizer_state.pt
[2025-08-20 17:35:08,713][__main__][INFO] - Iteration 606 took 54s (37.12% Gen, 62.88% Train). Generation: 20s, Training: 34s. Estimated remaining time: 5h 51m 23s. Estimated total time: 15h 16m 0s. Time estimates for 10 more iterations: 9m 9s, 100 more iterations: 1h 31m 36s, 500 more iterations: 7h 38m 0s.
[2025-08-20 17:35:08,714][__main__][INFO] - Starting iteration 606.
[2025-08-20 17:35:31,955][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 17:35:31,956][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 17:35:31,963][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 17:35:34,462][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 17:35:34,464][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 17:35:34,470][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 17:35:34,472][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data.
[2025-08-20 17:35:34,473][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets.
[2025-08-20 17:35:34,771][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:35:35,562][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:35:36,352][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:35:37,147][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:35:37,938][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:35:38,729][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:35:39,526][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:35:40,319][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:35:41,112][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:35:41,907][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:35:42,702][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:35:43,495][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:35:44,289][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:35:45,086][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:35:45,881][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:35:46,675][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:35:47,468][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:35:48,264][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:35:49,060][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:35:49,854][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:35:51,147][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:35:51,944][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:35:52,740][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:35:53,535][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:35:54,329][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:35:55,126][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:35:55,924][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:35:56,724][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:35:57,522][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:35:58,320][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:35:59,116][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:35:59,911][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:36:01,507][mllm.training.trainer_common][INFO] - For task: Apply reinforce step
,
            ΔVRAM Allocated: 2.288818359375e-05 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:26,
            Percentage of VRAM taken: 69.91579512747062%,
            
[2025-08-20 17:36:02,496][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/policy_optimizer_state.pt
[2025-08-20 17:36:02,497][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/critic_optimizer_state.pt
[2025-08-20 17:36:04,407][__main__][INFO] - Iteration 607 took 55s (37.29% Gen, 62.71% Train). Generation: 20s, Training: 34s. Estimated remaining time: 6h 2m 38s. Estimated total time: 15h 28m 11s. Time estimates for 10 more iterations: 9m 16s, 100 more iterations: 1h 32m 49s, 500 more iterations: 7h 44m 5s.
[2025-08-20 17:36:04,408][__main__][INFO] - Starting iteration 607.
[2025-08-20 17:36:27,329][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 17:36:27,330][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 17:36:27,337][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 17:36:29,785][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 17:36:29,787][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 17:36:29,793][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 17:36:29,795][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data.
[2025-08-20 17:36:29,796][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets.
[2025-08-20 17:36:30,094][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:36:30,885][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:36:31,678][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:36:32,471][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:36:33,266][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:36:34,059][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:36:34,854][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:36:35,651][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:36:36,446][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:36:37,240][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:36:38,034][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:36:38,830][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:36:39,628][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:36:40,423][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:36:41,218][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:36:42,446][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:36:43,240][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:36:44,033][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:36:44,830][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:36:45,627][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:36:46,423][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:36:47,218][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:36:48,012][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:36:48,809][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:36:49,606][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:36:50,404][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:36:51,200][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:36:51,995][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:36:52,790][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:36:53,588][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:36:54,383][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:36:55,179][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:36:56,738][mllm.training.trainer_common][INFO] - For task: Apply reinforce step
,
            ΔVRAM Allocated: 2.288818359375e-05 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:26,
            Percentage of VRAM taken: 69.91579512747062%,
            
[2025-08-20 17:36:57,686][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/policy_optimizer_state.pt
[2025-08-20 17:36:57,688][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/critic_optimizer_state.pt
[2025-08-20 17:36:59,438][__main__][INFO] - Iteration 608 took 55s (37.21% Gen, 62.78% Train). Generation: 20s, Training: 34s. Estimated remaining time: 5h 50m 41s. Estimated total time: 15h 17m 9s. Time estimates for 10 more iterations: 9m 10s, 100 more iterations: 1h 31m 42s, 500 more iterations: 7h 38m 34s.
[2025-08-20 17:36:59,440][__main__][INFO] - Starting iteration 608.
[2025-08-20 17:37:22,414][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 17:37:22,415][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 17:37:22,421][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 17:37:24,869][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 17:37:24,870][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 17:37:24,877][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 17:37:24,879][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data.
[2025-08-20 17:37:24,880][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets.
[2025-08-20 17:37:25,180][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:37:25,973][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:37:26,766][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:37:27,558][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:37:28,354][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:37:29,148][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:37:29,942][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:37:30,735][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:37:31,532][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:37:32,327][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:37:33,122][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:37:33,916][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:37:34,712][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:37:35,509][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:37:36,303][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:37:37,098][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:37:37,894][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:37:38,692][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:37:39,490][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:37:40,286][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:37:41,081][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:37:41,875][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:37:42,670][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:37:43,468][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:37:44,265][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:37:45,059][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:37:46,347][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:37:47,142][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:37:47,936][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:37:48,731][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:37:49,524][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:37:50,318][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:37:51,917][mllm.training.trainer_common][INFO] - For task: Apply reinforce step
,
            ΔVRAM Allocated: 2.288818359375e-05 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:26,
            Percentage of VRAM taken: 69.91579512747062%,
            
[2025-08-20 17:37:52,864][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/policy_optimizer_state.pt
[2025-08-20 17:37:52,866][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/critic_optimizer_state.pt
[2025-08-20 17:37:54,165][__main__][INFO] - Iteration 609 took 54s (37.49% Gen, 62.51% Train). Generation: 20s, Training: 34s. Estimated remaining time: 5h 44m 41s. Estimated total time: 15h 12m 4s. Time estimates for 10 more iterations: 9m 7s, 100 more iterations: 1h 31m 12s, 500 more iterations: 7h 36m 2s.
[2025-08-20 17:37:54,167][__main__][INFO] - Starting iteration 609.
[2025-08-20 17:38:16,973][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 17:38:16,975][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 17:38:16,981][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 17:38:19,430][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 17:38:19,431][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 17:38:19,438][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 17:38:19,441][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data.
[2025-08-20 17:38:19,441][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets.
[2025-08-20 17:38:19,740][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:38:20,533][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:38:21,325][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:38:22,118][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:38:22,914][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:38:23,707][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:38:24,501][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:38:25,296][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:38:26,091][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:38:26,885][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:38:27,679][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:38:28,475][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:38:29,272][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:38:30,066][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:38:30,859][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:38:31,654][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:38:32,451][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:38:33,713][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:38:34,509][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:38:35,303][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:38:36,097][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:38:36,894][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:38:37,692][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:38:38,497][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:38:39,338][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:38:40,134][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:38:40,931][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:38:41,728][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:38:42,522][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:38:43,315][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:38:44,108][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:38:44,901][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:38:46,537][mllm.training.trainer_common][INFO] - For task: Apply reinforce step
,
            ΔVRAM Allocated: 2.288818359375e-05 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:27,
            Percentage of VRAM taken: 69.91579512747062%,
            
[2025-08-20 17:38:47,493][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/policy_optimizer_state.pt
[2025-08-20 17:38:47,494][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/critic_optimizer_state.pt
[2025-08-20 17:38:48,869][__main__][INFO] - Iteration 610 took 54s (37.22% Gen, 62.78% Train). Generation: 20s, Training: 34s. Estimated remaining time: 5h 43m 23s. Estimated total time: 15h 11m 41s. Time estimates for 10 more iterations: 9m 7s, 100 more iterations: 1h 31m 10s, 500 more iterations: 7h 35m 50s.
[2025-08-20 17:38:48,873][__main__][INFO] - Starting iteration 610.
[2025-08-20 17:39:12,107][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 17:39:12,109][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 17:39:12,115][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 17:39:14,572][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 17:39:14,573][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 17:39:14,579][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 17:39:14,582][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data.
[2025-08-20 17:39:14,582][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets.
[2025-08-20 17:39:14,881][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:39:15,668][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:39:16,458][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:39:17,250][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:39:18,041][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:39:18,836][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:39:19,628][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:39:20,419][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:39:21,213][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:39:22,005][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:39:22,799][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:39:23,594][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:39:24,386][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:39:25,179][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:39:25,975][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:39:26,770][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:39:27,563][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:39:28,357][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:39:29,153][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:39:29,945][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:39:31,190][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:39:31,984][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:39:32,778][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:39:33,576][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:39:34,374][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:39:35,170][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:39:35,965][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:39:36,760][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:39:37,556][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:39:38,353][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:39:39,148][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:39:39,941][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:39:41,504][mllm.training.trainer_common][INFO] - For task: Apply reinforce step
,
            ΔVRAM Allocated: 2.288818359375e-05 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:26,
            Percentage of VRAM taken: 69.91579512747062%,
            
[2025-08-20 17:39:42,452][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/policy_optimizer_state.pt
[2025-08-20 17:39:42,454][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/critic_optimizer_state.pt
[2025-08-20 17:39:44,049][__main__][INFO] - Iteration 611 took 55s (37.69% Gen, 62.31% Train). Generation: 20s, Training: 34s. Estimated remaining time: 5h 50m 23s. Estimated total time: 15h 19m 36s. Time estimates for 10 more iterations: 9m 11s, 100 more iterations: 1h 31m 57s, 500 more iterations: 7h 39m 48s.
[2025-08-20 17:39:44,051][__main__][INFO] - Starting iteration 611.
[2025-08-20 17:40:07,482][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 17:40:07,483][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 17:40:07,489][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 17:40:09,960][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 17:40:09,962][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 17:40:09,969][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 17:40:09,971][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data.
[2025-08-20 17:40:09,971][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets.
[2025-08-20 17:40:10,269][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:40:11,059][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:40:11,849][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:40:12,641][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:40:13,433][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:40:14,224][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:40:15,018][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:40:15,810][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:40:16,603][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:40:17,401][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:40:18,195][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:40:18,987][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:40:19,780][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:40:20,574][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:40:21,366][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:40:22,160][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:40:22,955][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:40:23,748][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:40:24,541][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:40:25,337][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:40:26,132][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:40:26,925][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:40:27,719][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:40:29,022][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:40:29,818][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:40:30,616][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:40:31,413][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:40:32,208][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:40:33,001][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:40:33,796][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:40:34,592][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:40:35,385][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:40:36,998][mllm.training.trainer_common][INFO] - For task: Apply reinforce step
,
            ΔVRAM Allocated: 2.288818359375e-05 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:26,
            Percentage of VRAM taken: 69.91579512747062%,
            
[2025-08-20 17:40:38,169][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/policy_optimizer_state.pt
[2025-08-20 17:40:38,171][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/critic_optimizer_state.pt
[2025-08-20 17:40:39,468][__main__][INFO] - Iteration 612 took 55s (37.85% Gen, 62.15% Train). Generation: 20s, Training: 34s. Estimated remaining time: 5h 53m 28s. Estimated total time: 15h 23m 37s. Time estimates for 10 more iterations: 9m 14s, 100 more iterations: 1h 32m 21s, 500 more iterations: 7h 41m 48s.
[2025-08-20 17:40:39,470][__main__][INFO] - Starting iteration 612.
[2025-08-20 17:41:02,332][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 17:41:02,333][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 17:41:02,339][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 17:41:04,804][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 17:41:04,805][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 17:41:04,812][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 17:41:04,814][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data.
[2025-08-20 17:41:04,815][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets.
[2025-08-20 17:41:05,114][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:41:05,904][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:41:06,698][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:41:07,491][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:41:08,286][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:41:09,083][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:41:09,878][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:41:10,671][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:41:11,467][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:41:12,264][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:41:13,061][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:41:13,855][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:41:14,650][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:41:15,445][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:41:16,244][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:41:17,041][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:41:17,836][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:41:18,631][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:41:19,426][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:41:20,223][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:41:21,021][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:41:21,819][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:41:23,050][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:41:23,845][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:41:24,640][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:41:25,435][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:41:26,231][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:41:27,023][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:41:27,820][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:41:28,616][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:41:29,410][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:41:30,203][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:41:31,789][mllm.training.trainer_common][INFO] - For task: Apply reinforce step
,
            ΔVRAM Allocated: 2.288818359375e-05 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:26,
            Percentage of VRAM taken: 69.91579512747062%,
            
[2025-08-20 17:41:32,733][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/policy_optimizer_state.pt
[2025-08-20 17:41:32,734][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/critic_optimizer_state.pt
[2025-08-20 17:41:34,032][__main__][INFO] - Iteration 613 took 54s (37.37% Gen, 62.63% Train). Generation: 20s, Training: 34s. Estimated remaining time: 5h 38m 18s. Estimated total time: 15h 9m 21s. Time estimates for 10 more iterations: 9m 5s, 100 more iterations: 1h 30m 56s, 500 more iterations: 7h 34m 40s.
[2025-08-20 17:41:34,033][__main__][INFO] - Starting iteration 613.
[2025-08-20 17:41:57,090][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 17:41:57,091][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 17:41:57,098][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 17:41:59,567][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 17:41:59,569][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 17:41:59,575][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 17:41:59,577][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data.
[2025-08-20 17:41:59,578][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets.
[2025-08-20 17:41:59,876][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:42:00,668][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:42:01,463][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:42:02,255][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:42:03,049][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:42:03,846][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:42:04,642][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:42:05,436][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:42:06,231][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:42:07,027][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:42:07,825][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:42:08,620][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:42:09,414][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:42:10,210][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:42:11,008][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:42:11,806][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:42:12,603][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:42:13,399][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:42:14,193][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:42:15,529][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:42:16,325][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:42:17,120][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:42:17,914][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:42:18,708][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:42:19,502][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:42:20,299][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:42:21,091][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:42:21,884][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:42:22,680][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:42:23,473][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:42:24,265][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:42:25,061][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:42:26,625][mllm.training.trainer_common][INFO] - For task: Apply reinforce step
,
            ΔVRAM Allocated: 2.288818359375e-05 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:27,
            Percentage of VRAM taken: 69.91579512747062%,
            
[2025-08-20 17:42:27,568][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/policy_optimizer_state.pt
[2025-08-20 17:42:27,569][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/critic_optimizer_state.pt
[2025-08-20 17:42:28,855][__main__][INFO] - Iteration 614 took 54s (37.56% Gen, 62.44% Train). Generation: 20s, Training: 34s. Estimated remaining time: 5h 41m 43s. Estimated total time: 15h 13m 41s. Time estimates for 10 more iterations: 9m 8s, 100 more iterations: 1h 31m 22s, 500 more iterations: 7h 36m 50s.
[2025-08-20 17:42:28,857][__main__][INFO] - Starting iteration 614.
[2025-08-20 17:42:51,798][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 17:42:51,800][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 17:42:51,806][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 17:42:54,278][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 17:42:54,279][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 17:42:54,285][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 17:42:54,288][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data.
[2025-08-20 17:42:54,289][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets.
[2025-08-20 17:42:54,594][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:42:55,388][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:42:56,180][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:42:56,973][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:42:57,770][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:42:58,565][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:42:59,359][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:43:00,154][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:43:00,950][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:43:01,746][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:43:02,540][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:43:03,333][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:43:04,131][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:43:04,928][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:43:05,724][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:43:06,519][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:43:07,313][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:43:08,110][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:43:08,907][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:43:10,198][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:43:10,991][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:43:11,788][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:43:12,586][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:43:13,383][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:43:14,178][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:43:14,973][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:43:15,767][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:43:16,564][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:43:17,359][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:43:18,153][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:43:18,947][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:43:19,744][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:43:21,356][mllm.training.trainer_common][INFO] - For task: Apply reinforce step
,
            ΔVRAM Allocated: 2.288818359375e-05 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:27,
            Percentage of VRAM taken: 69.91579512747062%,
            
[2025-08-20 17:43:22,408][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/policy_optimizer_state.pt
[2025-08-20 17:43:22,410][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/critic_optimizer_state.pt
[2025-08-20 17:43:23,734][__main__][INFO] - Iteration 615 took 54s (37.31% Gen, 62.68% Train). Generation: 20s, Training: 34s. Estimated remaining time: 5h 41m 43s. Estimated total time: 15h 14m 36s. Time estimates for 10 more iterations: 9m 8s, 100 more iterations: 1h 31m 27s, 500 more iterations: 7h 37m 18s.
[2025-08-20 17:43:23,735][__main__][INFO] - Starting iteration 615.
[2025-08-20 17:43:46,882][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 17:43:46,884][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 17:43:46,890][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 17:43:49,349][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 17:43:49,351][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 17:43:49,357][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 17:43:49,359][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data.
[2025-08-20 17:43:49,360][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets.
[2025-08-20 17:43:49,659][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:43:50,452][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:43:51,245][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:43:52,038][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:43:52,834][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:43:53,631][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:43:54,425][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:43:55,218][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:43:56,014][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:43:56,811][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:43:57,607][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:43:58,402][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:43:59,196][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:43:59,993][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:44:00,791][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:44:01,587][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:44:02,381][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:44:03,176][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:44:03,972][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:44:04,771][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:44:05,568][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:44:06,364][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:44:07,158][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:44:08,406][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:44:09,200][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:44:09,993][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:44:10,788][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:44:11,583][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:44:12,376][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:44:13,169][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:44:13,967][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:44:14,763][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:44:16,325][mllm.training.trainer_common][INFO] - For task: Apply reinforce step
,
            ΔVRAM Allocated: 2.288818359375e-05 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:26,
            Percentage of VRAM taken: 69.91579512747062%,
            
[2025-08-20 17:44:17,284][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/policy_optimizer_state.pt
[2025-08-20 17:44:17,285][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/critic_optimizer_state.pt
[2025-08-20 17:44:18,606][__main__][INFO] - Iteration 616 took 54s (37.71% Gen, 62.29% Train). Generation: 20s, Training: 34s. Estimated remaining time: 5h 40m 42s. Estimated total time: 15h 14m 30s. Time estimates for 10 more iterations: 9m 8s, 100 more iterations: 1h 31m 27s, 500 more iterations: 7h 37m 15s.
[2025-08-20 17:44:18,607][__main__][INFO] - Starting iteration 616.
[2025-08-20 17:44:41,629][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 17:44:41,630][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 17:44:41,636][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 17:44:44,090][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 17:44:44,091][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 17:44:44,098][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 17:44:44,100][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data.
[2025-08-20 17:44:44,100][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets.
[2025-08-20 17:44:44,398][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:44:45,190][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:44:45,981][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:44:46,777][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:44:47,571][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:44:48,364][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:44:49,158][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:44:49,956][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:44:50,751][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:44:51,546][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:44:52,340][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:44:53,136][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:44:53,933][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:44:54,729][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:44:55,523][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:44:56,318][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:44:57,115][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:44:57,912][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:44:58,708][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:44:59,502][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:45:00,297][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:45:01,096][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:45:01,894][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:45:02,691][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:45:04,076][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:45:04,871][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:45:05,666][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:45:06,458][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:45:07,252][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:45:08,050][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:45:08,846][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:45:09,639][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:45:11,201][mllm.training.trainer_common][INFO] - For task: Apply reinforce step
,
            ΔVRAM Allocated: 2.288818359375e-05 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:27,
            Percentage of VRAM taken: 69.91579512747062%,
            
[2025-08-20 17:45:12,158][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/policy_optimizer_state.pt
[2025-08-20 17:45:12,159][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/critic_optimizer_state.pt
[2025-08-20 17:45:13,457][__main__][INFO] - Iteration 617 took 54s (37.51% Gen, 62.48% Train). Generation: 20s, Training: 34s. Estimated remaining time: 5h 39m 26s. Estimated total time: 15h 14m 9s. Time estimates for 10 more iterations: 9m 8s, 100 more iterations: 1h 31m 24s, 500 more iterations: 7h 37m 4s.
[2025-08-20 17:45:13,458][__main__][INFO] - Starting iteration 617.
[2025-08-20 17:45:36,400][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 17:45:36,401][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 17:45:36,407][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 17:45:38,877][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 17:45:38,879][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 17:45:38,885][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 17:45:38,887][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data.
[2025-08-20 17:45:38,888][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets.
[2025-08-20 17:45:39,185][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:45:39,978][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:45:40,771][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:45:41,564][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:45:42,360][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:45:43,156][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:45:43,950][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:45:44,743][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:45:45,539][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:45:46,336][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:45:47,131][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:45:47,925][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:45:48,721][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:45:49,518][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:45:50,315][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:45:51,112][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:45:51,907][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:45:52,702][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:45:53,497][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:45:54,762][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:45:55,558][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:45:56,356][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:45:57,153][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:45:57,948][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:45:58,741][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:45:59,537][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:46:00,333][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:46:01,129][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:46:01,922][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:46:02,716][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:46:03,513][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:46:04,310][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:46:05,873][mllm.training.trainer_common][INFO] - For task: Apply reinforce step
,
            ΔVRAM Allocated: 2.288818359375e-05 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:26,
            Percentage of VRAM taken: 69.91579512747062%,
            
[2025-08-20 17:46:06,819][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/policy_optimizer_state.pt
[2025-08-20 17:46:06,821][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/critic_optimizer_state.pt
[2025-08-20 17:46:08,172][__main__][INFO] - Iteration 618 took 54s (37.43% Gen, 62.57% Train). Generation: 20s, Training: 34s. Estimated remaining time: 5h 36m 16s. Estimated total time: 15h 11m 53s. Time estimates for 10 more iterations: 9m 7s, 100 more iterations: 1h 31m 11s, 500 more iterations: 7h 35m 56s.
[2025-08-20 17:46:08,173][__main__][INFO] - Starting iteration 618.
[2025-08-20 17:46:31,135][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 17:46:31,136][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 17:46:31,142][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 17:46:33,607][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 17:46:33,608][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 17:46:33,615][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 17:46:33,617][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data.
[2025-08-20 17:46:33,618][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets.
[2025-08-20 17:46:33,917][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:46:34,707][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:46:35,503][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:46:36,299][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:46:37,093][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:46:37,886][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:46:38,683][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:46:39,479][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:46:40,273][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:46:41,067][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:46:41,863][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:46:42,661][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:46:43,456][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:46:44,250][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:46:45,044][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:46:45,841][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:46:46,639][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:46:47,435][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:46:48,229][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:46:49,541][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:46:50,334][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:46:51,127][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:46:51,919][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:46:52,714][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:46:53,505][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:46:54,297][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:46:55,090][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:46:55,881][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:46:56,676][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:46:57,469][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:46:58,260][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:46:59,055][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:47:00,641][mllm.training.trainer_common][INFO] - For task: Apply reinforce step
,
            ΔVRAM Allocated: 2.288818359375e-05 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:26,
            Percentage of VRAM taken: 69.91579512747062%,
            
[2025-08-20 17:47:01,585][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/policy_optimizer_state.pt
[2025-08-20 17:47:01,587][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/critic_optimizer_state.pt
[2025-08-20 17:47:02,811][__main__][INFO] - Iteration 619 took 54s (37.52% Gen, 62.48% Train). Generation: 20s, Training: 34s. Estimated remaining time: 5h 34m 5s. Estimated total time: 15h 10m 36s. Time estimates for 10 more iterations: 9m 6s, 100 more iterations: 1h 31m 3s, 500 more iterations: 7h 35m 18s.
[2025-08-20 17:47:02,812][__main__][INFO] - Starting iteration 619.
[2025-08-20 17:47:25,769][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 17:47:25,770][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 17:47:25,776][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 17:47:28,241][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 17:47:28,243][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 17:47:28,249][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 17:47:28,251][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data.
[2025-08-20 17:47:28,252][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets.
[2025-08-20 17:47:28,556][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:47:29,347][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:47:30,142][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:47:30,936][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:47:31,730][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:47:32,526][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:47:33,322][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:47:34,116][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:47:34,910][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:47:35,706][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:47:36,504][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:47:37,301][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:47:38,096][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:47:38,890][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:47:39,686][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:47:40,483][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:47:41,280][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:47:42,075][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:47:43,391][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:47:44,186][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:47:44,983][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:47:45,781][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:47:46,577][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:47:47,371][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:47:48,164][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:47:48,960][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:47:49,755][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:47:50,548][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:47:51,342][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:47:52,139][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:47:52,935][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:47:53,729][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:47:55,332][mllm.training.trainer_common][INFO] - For task: Apply reinforce step
,
            ΔVRAM Allocated: 2.288818359375e-05 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:27,
            Percentage of VRAM taken: 69.91579512747062%,
            
[2025-08-20 17:47:56,339][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/policy_optimizer_state.pt
[2025-08-20 17:47:56,341][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/critic_optimizer_state.pt
[2025-08-20 17:47:57,616][__main__][INFO] - Iteration 620 took 54s (37.41% Gen, 62.58% Train). Generation: 20s, Training: 34s. Estimated remaining time: 5h 35m 56s. Estimated total time: 15h 13m 23s. Time estimates for 10 more iterations: 9m 8s, 100 more iterations: 1h 31m 20s, 500 more iterations: 7h 36m 41s.
[2025-08-20 17:47:57,617][__main__][INFO] - Starting iteration 620.
[2025-08-20 17:48:20,966][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 17:48:20,968][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 17:48:20,974][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 17:48:23,416][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 17:48:23,417][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 17:48:23,424][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 17:48:23,426][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data.
[2025-08-20 17:48:23,427][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets.
[2025-08-20 17:48:23,725][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:48:24,514][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:48:25,309][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:48:26,104][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:48:26,897][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:48:27,690][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:48:28,487][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:48:29,283][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:48:30,076][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:48:30,870][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:48:31,667][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:48:32,463][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:48:33,257][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:48:34,051][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:48:34,848][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:48:35,646][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:48:36,443][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:48:37,238][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:48:38,033][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:48:38,828][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:48:39,626][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:48:40,867][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:48:41,659][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:48:42,449][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:48:43,242][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:48:44,035][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:48:44,827][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:48:45,623][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:48:46,417][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:48:47,209][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:48:48,002][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:48:48,797][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:48:50,371][mllm.training.trainer_common][INFO] - For task: Apply reinforce step
,
            ΔVRAM Allocated: 2.288818359375e-05 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:26,
            Percentage of VRAM taken: 69.91579512747062%,
            
[2025-08-20 17:48:51,335][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/policy_optimizer_state.pt
[2025-08-20 17:48:51,336][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/critic_optimizer_state.pt
[2025-08-20 17:48:52,739][__main__][INFO] - Iteration 621 took 55s (37.94% Gen, 62.06% Train). Generation: 20s, Training: 34s. Estimated remaining time: 5h 40m 20s. Estimated total time: 15h 18m 41s. Time estimates for 10 more iterations: 9m 11s, 100 more iterations: 1h 31m 52s, 500 more iterations: 7h 39m 20s.
[2025-08-20 17:48:52,741][__main__][INFO] - Starting iteration 621.
[2025-08-20 17:49:15,828][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 17:49:15,829][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 17:49:15,835][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 17:49:18,290][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 17:49:18,292][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 17:49:18,298][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 17:49:18,301][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data.
[2025-08-20 17:49:18,301][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets.
[2025-08-20 17:49:18,601][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:49:19,391][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:49:20,185][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:49:20,976][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:49:21,772][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:49:22,567][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:49:23,360][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:49:24,152][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:49:24,950][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:49:25,746][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:49:26,539][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:49:27,334][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:49:28,130][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:49:28,927][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:49:29,722][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:49:30,515][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:49:31,311][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:49:32,109][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:49:32,906][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:49:33,702][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:49:34,496][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:49:35,290][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:49:36,664][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:49:37,457][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:49:38,249][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:49:39,045][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:49:39,841][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:49:40,635][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:49:41,429][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:49:42,225][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:49:43,021][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:49:43,816][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:49:45,427][mllm.training.trainer_common][INFO] - For task: Apply reinforce step
,
            ΔVRAM Allocated: 2.288818359375e-05 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:27,
            Percentage of VRAM taken: 69.91579512747062%,
            
[2025-08-20 17:49:46,491][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/policy_optimizer_state.pt
[2025-08-20 17:49:46,493][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/critic_optimizer_state.pt
[2025-08-20 17:49:47,770][__main__][INFO] - Iteration 622 took 55s (37.49% Gen, 62.51% Train). Generation: 20s, Training: 34s. Estimated remaining time: 5h 37m 52s. Estimated total time: 15h 17m 8s. Time estimates for 10 more iterations: 9m 10s, 100 more iterations: 1h 31m 42s, 500 more iterations: 7h 38m 34s.
[2025-08-20 17:49:47,772][__main__][INFO] - Starting iteration 622.
[2025-08-20 17:50:10,816][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 17:50:10,818][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 17:50:10,824][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 17:50:13,287][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 17:50:13,289][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 17:50:13,295][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 17:50:13,298][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data.
[2025-08-20 17:50:13,298][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets.
[2025-08-20 17:50:13,597][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:50:14,390][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:50:15,182][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:50:15,976][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:50:16,771][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:50:17,564][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:50:18,358][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:50:19,154][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:50:19,951][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:50:20,746][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:50:21,539][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:50:22,334][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:50:23,131][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:50:23,927][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:50:24,721][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:50:25,515][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:50:26,313][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:50:27,112][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:50:27,910][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:50:28,707][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:50:29,948][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:50:30,741][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:50:31,532][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:50:32,327][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:50:33,119][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:50:33,911][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:50:34,707][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:50:35,500][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:50:36,293][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:50:37,088][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:50:37,883][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:50:38,676][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:50:40,237][mllm.training.trainer_common][INFO] - For task: Apply reinforce step
,
            ΔVRAM Allocated: 2.288818359375e-05 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:26,
            Percentage of VRAM taken: 69.91579512747062%,
            
[2025-08-20 17:50:41,160][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/policy_optimizer_state.pt
[2025-08-20 17:50:41,162][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/critic_optimizer_state.pt
[2025-08-20 17:50:42,557][__main__][INFO] - Iteration 623 took 54s (37.57% Gen, 62.43% Train). Generation: 20s, Training: 34s. Estimated remaining time: 5h 32m 52s. Estimated total time: 15h 13m 3s. Time estimates for 10 more iterations: 9m 7s, 100 more iterations: 1h 31m 18s, 500 more iterations: 7h 36m 31s.
[2025-08-20 17:50:42,559][__main__][INFO] - Starting iteration 623.
[2025-08-20 17:51:05,545][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 17:51:05,547][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 17:51:05,553][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 17:51:08,009][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 17:51:08,011][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 17:51:08,017][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 17:51:08,019][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data.
[2025-08-20 17:51:08,020][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets.
[2025-08-20 17:51:08,325][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:51:09,119][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:51:09,914][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:51:10,708][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:51:11,501][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:51:12,298][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:51:13,096][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:51:13,891][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:51:14,685][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:51:15,479][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:51:16,276][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:51:17,073][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:51:17,868][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:51:18,662][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:51:19,457][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:51:20,255][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:51:21,051][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:51:21,846][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:51:22,641][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:51:23,437][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:51:24,235][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:51:25,032][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:51:26,321][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:51:27,114][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:51:27,911][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:51:28,705][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:51:29,499][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:51:30,294][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:51:31,091][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:51:31,887][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:51:32,681][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:51:33,475][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:51:35,072][mllm.training.trainer_common][INFO] - For task: Apply reinforce step
,
            ΔVRAM Allocated: 2.288818359375e-05 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:27,
            Percentage of VRAM taken: 69.91579512747062%,
            
[2025-08-20 17:51:36,021][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/policy_optimizer_state.pt
[2025-08-20 17:51:36,022][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/critic_optimizer_state.pt
[2025-08-20 17:51:37,420][__main__][INFO] - Iteration 624 took 54s (37.44% Gen, 62.56% Train). Generation: 20s, Training: 34s. Estimated remaining time: 5h 33m 14s. Estimated total time: 15h 14m 21s. Time estimates for 10 more iterations: 9m 8s, 100 more iterations: 1h 31m 26s, 500 more iterations: 7h 37m 10s.
[2025-08-20 17:51:37,421][__main__][INFO] - Starting iteration 624.
[2025-08-20 17:52:00,935][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 17:52:00,936][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 17:52:00,942][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 17:52:03,398][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 17:52:03,400][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 17:52:03,406][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 17:52:03,408][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data.
[2025-08-20 17:52:03,409][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets.
[2025-08-20 17:52:03,707][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:52:04,499][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:52:05,293][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:52:06,086][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:52:06,881][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:52:07,678][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:52:08,474][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:52:09,267][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:52:10,062][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:52:10,859][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:52:11,656][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:52:12,450][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:52:13,245][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:52:14,040][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:52:14,839][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:52:15,635][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:52:16,430][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:52:17,225][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:52:18,020][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:52:18,818][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:52:20,065][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:52:20,858][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:52:21,656][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:52:22,452][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:52:23,246][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:52:24,040][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:52:24,835][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:52:25,630][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:52:26,423][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:52:27,216][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:52:28,013][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:52:28,808][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:52:30,418][mllm.training.trainer_common][INFO] - For task: Apply reinforce step
,
            ΔVRAM Allocated: 2.288818359375e-05 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:26,
            Percentage of VRAM taken: 69.91579512747062%,
            
[2025-08-20 17:52:31,381][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/policy_optimizer_state.pt
[2025-08-20 17:52:31,382][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/critic_optimizer_state.pt
[2025-08-20 17:52:32,792][__main__][INFO] - Iteration 625 took 55s (38.04% Gen, 61.96% Train). Generation: 21s, Training: 34s. Estimated remaining time: 5h 40m 48s. Estimated total time: 15h 22m 49s. Time estimates for 10 more iterations: 9m 13s, 100 more iterations: 1h 32m 16s, 500 more iterations: 7h 41m 24s.
[2025-08-20 17:52:32,793][__main__][INFO] - Starting iteration 625.
[2025-08-20 17:52:56,537][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 17:52:56,538][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 17:52:56,545][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 17:52:58,998][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 17:52:59,000][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 17:52:59,006][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 17:52:59,008][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data.
[2025-08-20 17:52:59,009][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets.
[2025-08-20 17:52:59,325][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:53:00,116][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:53:00,906][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:53:01,700][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:53:02,495][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:53:03,287][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:53:04,081][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:53:04,875][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:53:05,667][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:53:06,461][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:53:07,256][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:53:08,049][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:53:08,843][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:53:09,639][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:53:10,433][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:53:11,227][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:53:12,020][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:53:12,816][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:53:13,609][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:53:14,838][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:53:15,631][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:53:16,422][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:53:17,216][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:53:18,008][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:53:18,800][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:53:19,595][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:53:20,387][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:53:21,180][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:53:21,975][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:53:22,769][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:53:23,562][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:53:24,356][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:53:25,933][mllm.training.trainer_common][INFO] - For task: Apply reinforce step
,
            ΔVRAM Allocated: 2.288818359375e-05 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:26,
            Percentage of VRAM taken: 69.91579512747062%,
            
[2025-08-20 17:53:26,882][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/policy_optimizer_state.pt
[2025-08-20 17:53:26,884][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/critic_optimizer_state.pt
[2025-08-20 17:53:28,161][__main__][INFO] - Iteration 626 took 55s (38.46% Gen, 61.54% Train). Generation: 21s, Training: 34s. Estimated remaining time: 5h 39m 50s. Estimated total time: 15h 22m 47s. Time estimates for 10 more iterations: 9m 13s, 100 more iterations: 1h 32m 16s, 500 more iterations: 7h 41m 23s.
[2025-08-20 17:53:28,163][__main__][INFO] - Starting iteration 626.
[2025-08-20 17:53:51,188][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 17:53:51,190][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 17:53:51,196][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 17:53:53,664][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 17:53:53,665][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 17:53:53,672][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 17:53:53,674][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data.
[2025-08-20 17:53:53,675][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets.
[2025-08-20 17:53:53,973][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:53:54,765][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:53:55,560][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:53:56,354][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:53:57,148][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:53:57,945][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:53:58,741][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:53:59,536][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:54:00,331][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:54:01,126][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:54:01,923][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:54:02,719][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:54:03,513][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:54:04,307][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:54:05,105][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:54:05,903][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:54:06,699][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:54:07,494][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:54:08,287][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:54:09,082][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:54:09,878][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:54:11,155][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:54:11,949][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:54:12,743][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:54:13,540][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:54:14,335][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:54:15,129][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:54:15,923][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:54:16,721][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:54:17,518][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:54:18,313][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:54:19,827][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:54:21,424][mllm.training.trainer_common][INFO] - For task: Apply reinforce step
,
            ΔVRAM Allocated: 2.288818359375e-05 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:27,
            Percentage of VRAM taken: 69.91579512747062%,
            
[2025-08-20 17:54:22,361][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/policy_optimizer_state.pt
[2025-08-20 17:54:22,363][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/critic_optimizer_state.pt
[2025-08-20 17:54:24,981][__main__][INFO] - Iteration 627 took 56s (36.19% Gen, 63.81% Train). Generation: 20s, Training: 36s. Estimated remaining time: 6h 3m 3s. Estimated total time: 15h 46m 57s. Time estimates for 10 more iterations: 9m 28s, 100 more iterations: 1h 34m 41s, 500 more iterations: 7h 53m 28s.
[2025-08-20 17:54:24,982][__main__][INFO] - Starting iteration 627.
[2025-08-20 17:54:48,066][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 17:54:48,067][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 17:54:48,074][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 17:54:50,517][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 17:54:50,518][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 17:54:50,524][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 17:54:50,527][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data.
[2025-08-20 17:54:50,527][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets.
[2025-08-20 17:54:50,827][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:54:51,618][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:54:52,411][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:54:53,207][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:54:54,002][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:54:54,796][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:54:55,590][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:54:56,387][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:54:57,183][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:54:57,977][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:54:58,771][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:54:59,569][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:55:00,367][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:55:01,165][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:55:01,961][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:55:02,756][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:55:03,549][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:55:04,344][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:55:05,140][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:55:05,933][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:55:07,246][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:55:08,043][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:55:08,838][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:55:09,632][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:55:10,425][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:55:11,222][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:55:12,018][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:55:13,452][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:55:14,906][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:55:15,701][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:55:16,496][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:55:17,289][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:55:18,863][mllm.training.trainer_common][INFO] - For task: Apply reinforce step
,
            ΔVRAM Allocated: 2.288818359375e-05 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:28,
            Percentage of VRAM taken: 69.91579512747062%,
            
[2025-08-20 17:55:19,838][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/policy_optimizer_state.pt
[2025-08-20 17:55:19,840][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/critic_optimizer_state.pt
[2025-08-20 17:55:21,168][__main__][INFO] - Iteration 628 took 56s (36.72% Gen, 63.27% Train). Generation: 20s, Training: 35s. Estimated remaining time: 5h 51m 35s. Estimated total time: 15h 36m 25s. Time estimates for 10 more iterations: 9m 21s, 100 more iterations: 1h 33m 38s, 500 more iterations: 7h 48m 12s.
[2025-08-20 17:55:21,169][__main__][INFO] - Starting iteration 628.
[2025-08-20 17:55:44,216][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 17:55:44,217][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 17:55:44,224][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 17:55:46,699][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 17:55:46,701][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 17:55:46,707][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 17:55:46,709][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data.
[2025-08-20 17:55:46,710][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets.
[2025-08-20 17:55:47,010][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:55:47,801][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:55:48,594][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:55:49,390][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:55:50,185][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:55:50,978][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:55:51,773][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:55:52,570][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:55:53,366][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:55:54,160][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:55:54,955][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:55:55,751][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:55:56,549][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:55:57,346][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:55:58,140][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:55:58,932][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:55:59,728][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:56:00,523][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:56:01,317][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:56:02,110][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:56:03,412][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:56:04,205][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:56:05,001][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:56:05,795][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:56:06,588][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:56:07,385][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:56:08,181][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:56:08,974][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:56:09,768][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:56:10,562][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:56:11,356][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:56:12,150][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:56:13,728][mllm.training.trainer_common][INFO] - For task: Apply reinforce step
,
            ΔVRAM Allocated: 2.288818359375e-05 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:26,
            Percentage of VRAM taken: 69.91579512747062%,
            
[2025-08-20 17:56:14,679][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/policy_optimizer_state.pt
[2025-08-20 17:56:14,680][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/critic_optimizer_state.pt
[2025-08-20 17:56:16,003][__main__][INFO] - Iteration 629 took 54s (37.54% Gen, 62.46% Train). Generation: 20s, Training: 34s. Estimated remaining time: 5h 28m 8s. Estimated total time: 15h 13m 53s. Time estimates for 10 more iterations: 9m 8s, 100 more iterations: 1h 31m 23s, 500 more iterations: 7h 36m 56s.
[2025-08-20 17:56:16,004][__main__][INFO] - Starting iteration 629.
[2025-08-20 17:56:39,169][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 17:56:39,170][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 17:56:39,177][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 17:56:41,632][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 17:56:41,634][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 17:56:41,640][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 17:56:41,643][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data.
[2025-08-20 17:56:41,643][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets.
[2025-08-20 17:56:41,945][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:56:42,736][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:56:43,533][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:56:44,327][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:56:45,121][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:56:45,915][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:56:46,713][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:56:47,509][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:56:48,304][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:56:49,098][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:56:49,894][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:56:50,691][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:56:51,488][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:56:52,281][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:56:53,074][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:56:53,870][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:56:54,665][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:56:55,458][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:56:56,668][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:56:57,464][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:56:58,257][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:56:59,052][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:56:59,850][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:57:00,645][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:57:01,439][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:57:02,233][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:57:03,029][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:57:03,827][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:57:04,623][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:57:05,416][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:57:06,211][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:57:07,007][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:57:08,583][mllm.training.trainer_common][INFO] - For task: Apply reinforce step
,
            ΔVRAM Allocated: 2.288818359375e-05 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:26,
            Percentage of VRAM taken: 69.91579512747062%,
            
[2025-08-20 17:57:09,547][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/policy_optimizer_state.pt
[2025-08-20 17:57:09,549][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/critic_optimizer_state.pt
[2025-08-20 17:57:10,869][__main__][INFO] - Iteration 630 took 54s (37.75% Gen, 62.24% Train). Generation: 20s, Training: 34s. Estimated remaining time: 5h 27m 44s. Estimated total time: 15h 14m 24s. Time estimates for 10 more iterations: 9m 8s, 100 more iterations: 1h 31m 26s, 500 more iterations: 7h 37m 12s.
[2025-08-20 17:57:10,871][__main__][INFO] - Starting iteration 630.
[2025-08-20 17:57:34,157][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 17:57:34,158][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 17:57:34,165][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 17:57:36,620][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 17:57:36,622][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 17:57:36,628][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 17:57:36,630][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data.
[2025-08-20 17:57:36,631][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets.
[2025-08-20 17:57:36,930][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:57:37,721][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:57:38,517][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:57:39,313][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:57:40,107][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:57:40,901][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:57:41,697][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:57:42,495][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:57:43,291][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:57:44,085][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:57:44,879][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:57:45,676][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:57:46,473][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:57:47,267][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:57:48,059][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:57:48,854][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:57:49,649][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:57:50,442][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:57:51,236][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:57:52,030][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:57:52,823][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:57:54,082][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:57:54,876][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:57:55,670][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:57:56,466][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:57:57,259][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:57:58,053][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:57:58,851][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:57:59,648][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:58:00,443][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:58:01,237][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:58:02,032][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:58:03,605][mllm.training.trainer_common][INFO] - For task: Apply reinforce step
,
            ΔVRAM Allocated: 2.288818359375e-05 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:26,
            Percentage of VRAM taken: 69.91579512747062%,
            
[2025-08-20 17:58:04,664][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/policy_optimizer_state.pt
[2025-08-20 17:58:04,666][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/critic_optimizer_state.pt
[2025-08-20 17:58:05,941][__main__][INFO] - Iteration 631 took 55s (37.86% Gen, 62.14% Train). Generation: 20s, Training: 34s. Estimated remaining time: 5h 30m 15s. Estimated total time: 15h 17m 50s. Time estimates for 10 more iterations: 9m 10s, 100 more iterations: 1h 31m 47s, 500 more iterations: 7h 38m 55s.
[2025-08-20 17:58:05,943][__main__][INFO] - Starting iteration 631.
[2025-08-20 17:58:29,176][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 17:58:29,178][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 17:58:29,184][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 17:58:31,644][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 17:58:31,646][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 17:58:31,652][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 17:58:31,654][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data.
[2025-08-20 17:58:31,655][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets.
[2025-08-20 17:58:31,953][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:58:32,743][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:58:33,538][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:58:34,332][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:58:35,126][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:58:35,920][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:58:36,717][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:58:37,512][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:58:38,306][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:58:39,100][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:58:39,895][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:58:40,691][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:58:41,483][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:58:42,277][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:58:43,073][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:58:43,866][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:58:44,659][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:58:45,455][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:58:46,250][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:58:47,043][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:58:47,837][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:58:48,634][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:58:49,429][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:58:50,222][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:58:51,017][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:58:51,814][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:58:52,610][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:58:53,403][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:58:54,681][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:58:55,475][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:58:56,270][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:58:57,064][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:58:58,632][mllm.training.trainer_common][INFO] - For task: Apply reinforce step
,
            ΔVRAM Allocated: 2.288818359375e-05 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:26,
            Percentage of VRAM taken: 69.91579512747062%,
            
[2025-08-20 17:58:59,567][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/policy_optimizer_state.pt
[2025-08-20 17:58:59,568][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/critic_optimizer_state.pt
[2025-08-20 17:59:00,867][__main__][INFO] - Iteration 632 took 54s (37.84% Gen, 62.16% Train). Generation: 20s, Training: 34s. Estimated remaining time: 5h 26m 53s. Estimated total time: 15h 15m 23s. Time estimates for 10 more iterations: 9m 9s, 100 more iterations: 1h 31m 32s, 500 more iterations: 7h 37m 41s.
[2025-08-20 17:59:00,868][__main__][INFO] - Starting iteration 632.
[2025-08-20 17:59:23,910][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 17:59:23,912][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 17:59:23,918][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 17:59:26,358][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 17:59:26,359][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 17:59:26,365][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 17:59:26,368][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data.
[2025-08-20 17:59:26,368][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets.
[2025-08-20 17:59:26,667][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:59:27,460][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:59:28,253][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:59:29,046][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:59:29,842][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:59:30,638][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:59:31,432][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:59:32,225][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:59:33,022][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:59:33,819][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:59:34,615][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:59:35,408][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:59:36,202][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:59:36,998][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:59:37,790][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:59:38,583][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:59:39,378][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:59:40,173][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:59:40,967][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:59:41,760][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:59:42,557][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:59:43,352][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:59:44,146][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:59:44,940][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:59:45,736][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:59:46,965][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:59:47,758][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:59:48,552][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:59:49,344][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:59:50,136][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:59:50,931][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:59:51,723][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 17:59:53,280][mllm.training.trainer_common][INFO] - For task: Apply reinforce step
,
            ΔVRAM Allocated: 2.288818359375e-05 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:26,
            Percentage of VRAM taken: 69.91579512747062%,
            
[2025-08-20 17:59:54,248][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/policy_optimizer_state.pt
[2025-08-20 17:59:54,249][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/critic_optimizer_state.pt
[2025-08-20 17:59:55,638][__main__][INFO] - Iteration 633 took 54s (37.62% Gen, 62.38% Train). Generation: 20s, Training: 34s. Estimated remaining time: 5h 23m 24s. Estimated total time: 15h 12m 49s. Time estimates for 10 more iterations: 9m 7s, 100 more iterations: 1h 31m 16s, 500 more iterations: 7h 36m 24s.
[2025-08-20 17:59:55,640][__main__][INFO] - Starting iteration 633.
[2025-08-20 18:00:18,623][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 18:00:18,624][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 18:00:18,630][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 18:00:21,099][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 18:00:21,100][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 18:00:21,107][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 18:00:21,109][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data.
[2025-08-20 18:00:21,110][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets.
[2025-08-20 18:00:21,408][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:00:22,201][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:00:22,992][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:00:23,786][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:00:24,581][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:00:25,373][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:00:26,168][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:00:26,961][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:00:27,755][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:00:28,548][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:00:29,345][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:00:30,142][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:00:30,935][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:00:31,727][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:00:32,522][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:00:33,318][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:00:34,113][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:00:34,906][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:00:36,181][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:00:36,976][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:00:37,770][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:00:38,564][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:00:39,360][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:00:40,157][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:00:40,951][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:00:41,744][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:00:42,540][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:00:43,337][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:00:44,132][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:00:44,925][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:00:45,720][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:00:46,517][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:00:48,102][mllm.training.trainer_common][INFO] - For task: Apply reinforce step
,
            ΔVRAM Allocated: 2.288818359375e-05 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:26,
            Percentage of VRAM taken: 69.91579512747062%,
            
[2025-08-20 18:00:49,050][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/policy_optimizer_state.pt
[2025-08-20 18:00:49,051][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/critic_optimizer_state.pt
[2025-08-20 18:00:50,304][__main__][INFO] - Iteration 634 took 54s (37.54% Gen, 62.46% Train). Generation: 20s, Training: 34s. Estimated remaining time: 5h 20m 44s. Estimated total time: 15h 11m 3s. Time estimates for 10 more iterations: 9m 6s, 100 more iterations: 1h 31m 6s, 500 more iterations: 7h 35m 31s.
[2025-08-20 18:00:50,306][__main__][INFO] - Starting iteration 634.
[2025-08-20 18:01:13,475][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 18:01:13,476][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 18:01:13,482][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 18:01:15,929][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 18:01:15,931][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 18:01:15,937][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 18:01:15,939][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data.
[2025-08-20 18:01:15,940][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets.
[2025-08-20 18:01:16,243][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:01:17,033][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:01:17,829][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:01:18,625][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:01:19,418][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:01:20,213][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:01:21,009][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:01:21,806][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:01:22,602][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:01:23,397][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:01:24,191][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:01:24,987][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:01:25,783][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:01:26,577][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:01:27,369][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:01:28,165][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:01:28,961][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:01:29,755][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:01:30,548][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:01:31,345][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:01:32,143][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:01:32,938][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:01:34,150][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:01:34,943][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:01:35,736][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:01:36,528][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:01:37,322][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:01:38,116][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:01:38,909][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:01:39,702][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:01:40,498][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:01:41,291][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:01:42,867][mllm.training.trainer_common][INFO] - For task: Apply reinforce step
,
            ΔVRAM Allocated: 2.288818359375e-05 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:26,
            Percentage of VRAM taken: 69.91579512747062%,
            
[2025-08-20 18:01:43,805][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/policy_optimizer_state.pt
[2025-08-20 18:01:43,806][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/critic_optimizer_state.pt
[2025-08-20 18:01:45,139][__main__][INFO] - Iteration 635 took 54s (37.82% Gen, 62.18% Train). Generation: 20s, Training: 34s. Estimated remaining time: 5h 22m 38s. Estimated total time: 15h 13m 52s. Time estimates for 10 more iterations: 9m 8s, 100 more iterations: 1h 31m 23s, 500 more iterations: 7h 36m 56s.
[2025-08-20 18:01:45,140][__main__][INFO] - Starting iteration 635.
[2025-08-20 18:02:08,472][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 18:02:08,474][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 18:02:08,480][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 18:02:10,956][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 18:02:10,957][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 18:02:10,964][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 18:02:10,966][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data.
[2025-08-20 18:02:10,966][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets.
[2025-08-20 18:02:11,267][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:02:12,057][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:02:12,850][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:02:13,643][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:02:14,435][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:02:15,230][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:02:16,024][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:02:16,817][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:02:17,611][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:02:18,409][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:02:19,203][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:02:19,995][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:02:20,788][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:02:21,582][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:02:22,374][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:02:23,168][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:02:23,962][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:02:24,755][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:02:25,548][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:02:26,814][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:02:27,608][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:02:28,405][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:02:29,200][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:02:29,994][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:02:30,789][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:02:31,586][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:02:32,382][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:02:33,177][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:02:33,970][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:02:34,766][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:02:35,563][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:02:36,359][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:02:37,963][mllm.training.trainer_common][INFO] - For task: Apply reinforce step
,
            ΔVRAM Allocated: 2.288818359375e-05 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:26,
            Percentage of VRAM taken: 69.91579512747062%,
            
[2025-08-20 18:02:38,881][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/policy_optimizer_state.pt
[2025-08-20 18:02:38,883][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/critic_optimizer_state.pt
[2025-08-20 18:02:40,099][__main__][INFO] - Iteration 636 took 54s (37.97% Gen, 62.03% Train). Generation: 20s, Training: 34s. Estimated remaining time: 5h 23m 48s. Estimated total time: 15h 15m 58s. Time estimates for 10 more iterations: 9m 9s, 100 more iterations: 1h 31m 35s, 500 more iterations: 7h 37m 59s.
[2025-08-20 18:02:40,100][__main__][INFO] - Starting iteration 636.
[2025-08-20 18:03:03,181][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 18:03:03,182][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 18:03:03,188][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 18:03:05,634][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 18:03:05,635][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 18:03:05,642][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 18:03:05,644][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data.
[2025-08-20 18:03:05,644][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets.
[2025-08-20 18:03:05,942][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:03:06,732][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:03:07,523][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:03:08,315][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:03:09,110][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:03:09,903][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:03:10,696][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:03:11,492][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:03:12,286][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:03:13,079][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:03:13,874][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:03:14,670][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:03:15,464][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:03:16,255][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:03:17,049][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:03:17,841][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:03:18,632][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:03:19,428][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:03:20,222][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:03:21,015][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:03:22,216][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:03:23,011][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:03:23,808][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:03:24,603][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:03:25,395][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:03:26,190][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:03:26,987][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:03:27,782][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:03:28,576][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:03:29,371][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:03:30,166][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:03:30,962][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:03:32,555][mllm.training.trainer_common][INFO] - For task: Apply reinforce step
,
            ΔVRAM Allocated: 2.288818359375e-05 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:26,
            Percentage of VRAM taken: 69.91579512747062%,
            
[2025-08-20 18:03:33,516][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/policy_optimizer_state.pt
[2025-08-20 18:03:33,517][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/critic_optimizer_state.pt
[2025-08-20 18:03:34,860][__main__][INFO] - Iteration 637 took 54s (37.68% Gen, 62.32% Train). Generation: 20s, Training: 34s. Estimated remaining time: 5h 19m 35s. Estimated total time: 15h 12m 39s. Time estimates for 10 more iterations: 9m 7s, 100 more iterations: 1h 31m 15s, 500 more iterations: 7h 36m 19s.
[2025-08-20 18:03:34,861][__main__][INFO] - Starting iteration 637.
[2025-08-20 18:03:57,980][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 18:03:57,981][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 18:03:57,988][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 18:04:00,456][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 18:04:00,457][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 18:04:00,463][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 18:04:00,466][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data.
[2025-08-20 18:04:00,466][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets.
[2025-08-20 18:04:00,765][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:04:01,557][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:04:02,353][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:04:03,146][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:04:03,940][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:04:04,736][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:04:05,533][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:04:06,326][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:04:07,121][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:04:07,916][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:04:08,714][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:04:09,510][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:04:10,303][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:04:11,096][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:04:11,893][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:04:12,688][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:04:13,480][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:04:14,274][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:04:15,071][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:04:15,866][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:04:17,101][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:04:17,895][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:04:18,691][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:04:19,486][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:04:20,280][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:04:21,073][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:04:21,870][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:04:22,666][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:04:23,460][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:04:24,255][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:04:25,051][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:04:25,847][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:04:27,427][mllm.training.trainer_common][INFO] - For task: Apply reinforce step
,
            ΔVRAM Allocated: 2.288818359375e-05 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:26,
            Percentage of VRAM taken: 69.91579512747062%,
            
[2025-08-20 18:04:28,454][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/policy_optimizer_state.pt
[2025-08-20 18:04:28,456][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/critic_optimizer_state.pt
[2025-08-20 18:04:29,845][__main__][INFO] - Iteration 638 took 54s (37.58% Gen, 62.42% Train). Generation: 20s, Training: 34s. Estimated remaining time: 5h 22m 24s. Estimated total time: 15h 16m 23s. Time estimates for 10 more iterations: 9m 9s, 100 more iterations: 1h 31m 38s, 500 more iterations: 7h 38m 11s.
[2025-08-20 18:04:29,847][__main__][INFO] - Starting iteration 638.
[2025-08-20 18:04:52,968][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 18:04:52,969][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 18:04:52,975][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 18:04:55,411][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 18:04:55,413][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 18:04:55,419][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 18:04:55,422][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data.
[2025-08-20 18:04:55,422][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets.
[2025-08-20 18:04:55,720][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:04:56,512][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:04:57,305][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:04:58,100][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:04:58,896][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:04:59,690][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:05:00,484][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:05:01,279][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:05:02,076][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:05:02,872][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:05:03,667][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:05:04,461][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:05:05,258][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:05:06,052][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:05:06,845][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:05:07,639][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:05:08,437][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:05:09,234][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:05:10,028][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:05:10,821][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:05:12,141][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:05:12,935][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:05:13,731][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:05:14,523][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:05:15,317][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:05:16,115][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:05:16,910][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:05:17,703][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:05:18,496][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:05:19,293][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:05:20,089][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:05:20,883][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:05:22,448][mllm.training.trainer_common][INFO] - For task: Apply reinforce step
,
            ΔVRAM Allocated: 2.288818359375e-05 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:26,
            Percentage of VRAM taken: 69.91579512747062%,
            
[2025-08-20 18:05:23,774][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/policy_optimizer_state.pt
[2025-08-20 18:05:23,776][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/critic_optimizer_state.pt
[2025-08-20 18:05:25,658][__main__][INFO] - Iteration 639 took 55s (37.04% Gen, 62.96% Train). Generation: 20s, Training: 35s. Estimated remaining time: 5h 35m 16s. Estimated total time: 15h 30m 10s. Time estimates for 10 more iterations: 9m 18s, 100 more iterations: 1h 33m 1s, 500 more iterations: 7h 45m 5s.
[2025-08-20 18:05:25,660][__main__][INFO] - Starting iteration 639.
[2025-08-20 18:05:48,755][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 18:05:48,756][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 18:05:48,762][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 18:05:51,226][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 18:05:51,228][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 18:05:51,234][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 18:05:51,236][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data.
[2025-08-20 18:05:51,237][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets.
[2025-08-20 18:05:51,535][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:05:52,325][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:05:53,121][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:05:53,915][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:05:54,708][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:05:55,503][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:05:56,300][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:05:57,095][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:05:57,889][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:05:58,684][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:05:59,481][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:06:00,278][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:06:01,071][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:06:01,863][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:06:02,659][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:06:03,454][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:06:04,248][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:06:05,041][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:06:05,838][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:06:06,634][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:06:07,427][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:06:08,220][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:06:09,432][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:06:10,227][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:06:11,021][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:06:11,816][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:06:12,612][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:06:13,406][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:06:14,200][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:06:14,997][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:06:15,795][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:06:16,591][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:06:18,164][mllm.training.trainer_common][INFO] - For task: Apply reinforce step
,
            ΔVRAM Allocated: 2.288818359375e-05 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:26,
            Percentage of VRAM taken: 69.91579512747062%,
            
[2025-08-20 18:06:19,120][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/policy_optimizer_state.pt
[2025-08-20 18:06:19,122][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/critic_optimizer_state.pt
[2025-08-20 18:06:20,623][__main__][INFO] - Iteration 640 took 54s (37.59% Gen, 62.41% Train). Generation: 20s, Training: 34s. Estimated remaining time: 5h 20m 13s. Estimated total time: 15h 16m 2s. Time estimates for 10 more iterations: 9m 9s, 100 more iterations: 1h 31m 36s, 500 more iterations: 7h 38m 1s.
[2025-08-20 18:06:20,624][__main__][INFO] - Starting iteration 640.
[2025-08-20 18:06:43,861][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 18:06:43,863][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 18:06:43,869][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 18:06:46,367][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 18:06:46,368][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 18:06:46,375][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 18:06:46,377][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data.
[2025-08-20 18:06:46,378][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets.
[2025-08-20 18:06:46,679][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:06:47,468][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:06:48,262][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:06:49,053][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:06:49,845][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:06:50,640][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:06:51,432][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:06:52,224][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:06:53,020][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:06:53,812][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:06:54,605][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:06:55,400][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:06:56,192][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:06:56,983][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:06:57,776][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:06:58,567][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:06:59,361][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:07:00,155][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:07:00,947][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:07:02,206][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:07:03,001][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:07:03,798][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:07:04,594][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:07:05,387][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:07:06,183][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:07:06,979][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:07:07,775][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:07:08,569][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:07:09,363][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:07:10,160][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:07:10,957][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:07:11,755][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:07:13,340][mllm.training.trainer_common][INFO] - For task: Apply reinforce step
,
            ΔVRAM Allocated: 2.288818359375e-05 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:26,
            Percentage of VRAM taken: 69.91579512747062%,
            
[2025-08-20 18:07:14,287][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/policy_optimizer_state.pt
[2025-08-20 18:07:14,289][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/critic_optimizer_state.pt
[2025-08-20 18:07:15,593][__main__][INFO] - Iteration 641 took 54s (37.75% Gen, 62.25% Train). Generation: 20s, Training: 34s. Estimated remaining time: 5h 19m 23s. Estimated total time: 15h 16m 8s. Time estimates for 10 more iterations: 9m 9s, 100 more iterations: 1h 31m 36s, 500 more iterations: 7h 38m 4s.
[2025-08-20 18:07:15,595][__main__][INFO] - Starting iteration 641.
[2025-08-20 18:07:39,106][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 18:07:39,107][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 18:07:39,114][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 18:07:41,586][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 18:07:41,587][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 18:07:41,594][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 18:07:41,596][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data.
[2025-08-20 18:07:41,597][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets.
[2025-08-20 18:07:41,894][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:07:42,687][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:07:43,480][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:07:44,273][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:07:45,069][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:07:45,865][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:07:46,661][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:07:47,454][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:07:48,249][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:07:49,045][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:07:49,840][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:07:50,633][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:07:51,426][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:07:52,222][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:07:53,014][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:07:53,807][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:07:54,604][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:07:55,399][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:07:56,193][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:07:57,434][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:07:58,227][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:07:59,021][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:07:59,817][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:08:00,610][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:08:01,405][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:08:02,203][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:08:02,999][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:08:03,793][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:08:04,587][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:08:05,383][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:08:06,181][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:08:06,979][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:08:08,601][mllm.training.trainer_common][INFO] - For task: Apply reinforce step
,
            ΔVRAM Allocated: 2.288818359375e-05 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:26,
            Percentage of VRAM taken: 69.91579512747062%,
            
[2025-08-20 18:08:10,191][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/policy_optimizer_state.pt
[2025-08-20 18:08:10,193][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/critic_optimizer_state.pt
[2025-08-20 18:08:11,534][__main__][INFO] - Iteration 642 took 55s (37.60% Gen, 62.39% Train). Generation: 21s, Training: 34s. Estimated remaining time: 5h 34m 38s. Estimated total time: 15h 32m 18s. Time estimates for 10 more iterations: 9m 19s, 100 more iterations: 1h 33m 13s, 500 more iterations: 7h 46m 9s.
[2025-08-20 18:08:11,535][__main__][INFO] - Starting iteration 642.
[2025-08-20 18:08:34,771][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 18:08:34,772][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 18:08:34,779][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 18:08:37,233][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 18:08:37,234][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 18:08:37,241][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 18:08:37,243][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data.
[2025-08-20 18:08:37,243][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets.
[2025-08-20 18:08:37,542][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:08:38,331][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:08:39,126][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:08:39,919][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:08:40,711][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:08:41,507][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:08:42,300][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:08:43,093][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:08:43,889][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:08:44,683][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:08:45,475][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:08:46,268][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:08:47,061][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:08:47,852][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:08:48,648][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:08:49,442][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:08:50,235][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:08:51,028][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:08:51,824][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:08:53,142][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:08:53,935][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:08:54,729][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:08:55,525][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:08:56,320][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:08:57,114][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:08:57,908][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:08:58,705][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:08:59,502][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:09:00,297][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:09:01,091][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:09:01,886][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:09:02,683][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:09:04,261][mllm.training.trainer_common][INFO] - For task: Apply reinforce step
,
            ΔVRAM Allocated: 2.288818359375e-05 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:26,
            Percentage of VRAM taken: 69.91579512747062%,
            
[2025-08-20 18:09:05,217][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/policy_optimizer_state.pt
[2025-08-20 18:09:05,218][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/critic_optimizer_state.pt
[2025-08-20 18:09:06,635][__main__][INFO] - Iteration 643 took 55s (37.70% Gen, 62.30% Train). Generation: 20s, Training: 34s. Estimated remaining time: 5h 19m 43s. Estimated total time: 15h 18m 19s. Time estimates for 10 more iterations: 9m 10s, 100 more iterations: 1h 31m 49s, 500 more iterations: 7h 39m 9s.
[2025-08-20 18:09:06,637][__main__][INFO] - Starting iteration 643.
[2025-08-20 18:09:30,210][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 18:09:30,211][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 18:09:30,218][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 18:09:32,675][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 18:09:32,677][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 18:09:32,683][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 18:09:32,685][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data.
[2025-08-20 18:09:32,686][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets.
[2025-08-20 18:09:32,986][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:09:33,777][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:09:34,573][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:09:35,368][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:09:36,161][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:09:36,955][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:09:37,753][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:09:38,549][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:09:39,343][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:09:40,137][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:09:40,933][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:09:41,730][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:09:42,526][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:09:43,319][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:09:44,113][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:09:44,910][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:09:45,707][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:09:46,502][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:09:47,296][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:09:48,090][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:09:48,888][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:09:50,177][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:09:50,969][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:09:51,766][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:09:52,558][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:09:53,350][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:09:54,142][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:09:54,934][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:09:55,728][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:09:56,524][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:09:57,319][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:09:58,112][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:09:59,676][mllm.training.trainer_common][INFO] - For task: Apply reinforce step
,
            ΔVRAM Allocated: 2.288818359375e-05 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:26,
            Percentage of VRAM taken: 69.91579512747062%,
            
[2025-08-20 18:10:00,608][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/policy_optimizer_state.pt
[2025-08-20 18:10:00,610][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/critic_optimizer_state.pt
[2025-08-20 18:10:01,960][__main__][INFO] - Iteration 644 took 55s (38.18% Gen, 61.82% Train). Generation: 21s, Training: 34s. Estimated remaining time: 5h 22m 32s. Estimated total time: 15h 22m 2s. Time estimates for 10 more iterations: 9m 13s, 100 more iterations: 1h 32m 12s, 500 more iterations: 7h 41m 1s.
[2025-08-20 18:10:01,961][__main__][INFO] - Starting iteration 644.
[2025-08-20 18:10:25,121][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 18:10:25,122][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 18:10:25,128][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 18:10:27,605][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 18:10:27,607][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 18:10:27,613][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 18:10:27,615][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data.
[2025-08-20 18:10:27,616][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets.
[2025-08-20 18:10:27,924][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:10:28,715][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:10:29,509][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:10:30,302][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:10:31,097][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:10:31,894][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:10:32,689][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:10:33,483][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:10:34,277][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:10:35,072][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:10:35,868][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:10:36,661][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:10:37,456][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:10:38,252][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:10:39,046][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:10:39,839][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:10:40,634][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:10:41,430][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:10:42,225][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:10:43,436][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:10:44,231][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:10:45,026][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:10:45,819][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:10:46,613][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:10:47,410][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:10:48,206][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:10:48,999][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:10:49,793][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:10:50,590][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:10:51,386][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:10:52,180][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:10:52,974][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:10:54,543][mllm.training.trainer_common][INFO] - For task: Apply reinforce step
,
            ΔVRAM Allocated: 2.288818359375e-05 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:26,
            Percentage of VRAM taken: 69.91579512747062%,
            
[2025-08-20 18:10:55,563][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/policy_optimizer_state.pt
[2025-08-20 18:10:55,566][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/critic_optimizer_state.pt
[2025-08-20 18:10:57,036][__main__][INFO] - Iteration 645 took 55s (37.56% Gen, 62.44% Train). Generation: 20s, Training: 34s. Estimated remaining time: 5h 17m 28s. Estimated total time: 15h 17m 54s. Time estimates for 10 more iterations: 9m 10s, 100 more iterations: 1h 31m 47s, 500 more iterations: 7h 38m 57s.
[2025-08-20 18:10:57,038][__main__][INFO] - Starting iteration 645.
[2025-08-20 18:11:20,786][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 18:11:20,787][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 18:11:20,793][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 18:11:23,254][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 18:11:23,255][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 18:11:23,262][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 18:11:23,264][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data.
[2025-08-20 18:11:23,265][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets.
[2025-08-20 18:11:23,564][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:11:24,357][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:11:25,151][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:11:25,943][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:11:26,739][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:11:27,536][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:11:28,331][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:11:29,126][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:11:29,922][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:11:30,719][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:11:31,517][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:11:32,312][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:11:33,104][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:11:33,897][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:11:34,693][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:11:35,485][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:11:36,278][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:11:37,073][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:11:37,867][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:11:38,660][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:11:39,456][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:11:40,253][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:11:41,049][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:11:42,307][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:11:43,102][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:11:43,896][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:11:44,693][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:11:45,489][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:11:46,283][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:11:47,077][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:11:47,873][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:11:48,670][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:11:50,242][mllm.training.trainer_common][INFO] - For task: Apply reinforce step
,
            ΔVRAM Allocated: 2.288818359375e-05 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:26,
            Percentage of VRAM taken: 69.91579512747062%,
            
[2025-08-20 18:11:51,209][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/policy_optimizer_state.pt
[2025-08-20 18:11:51,210][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/critic_optimizer_state.pt
[2025-08-20 18:11:52,530][__main__][INFO] - Iteration 646 took 55s (38.36% Gen, 61.63% Train). Generation: 21s, Training: 34s. Estimated remaining time: 5h 23m 29s. Estimated total time: 15h 24m 51s. Time estimates for 10 more iterations: 9m 14s, 100 more iterations: 1h 32m 29s, 500 more iterations: 7h 42m 25s.
[2025-08-20 18:11:52,531][__main__][INFO] - Starting iteration 646.
[2025-08-20 18:12:15,623][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 18:12:15,625][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 18:12:15,631][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 18:12:18,086][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 18:12:18,087][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 18:12:18,093][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 18:12:18,096][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data.
[2025-08-20 18:12:18,096][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets.
[2025-08-20 18:12:18,394][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:12:19,185][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:12:19,981][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:12:20,775][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:12:21,569][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:12:22,364][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:12:23,161][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:12:23,957][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:12:24,752][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:12:25,547][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:12:26,341][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:12:27,138][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:12:27,932][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:12:28,725][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:12:29,520][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:12:30,315][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:12:31,109][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:12:31,902][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:12:32,698][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:12:33,494][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:12:34,287][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:12:35,081][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:12:35,877][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:12:36,673][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:12:37,908][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:12:38,703][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:12:39,497][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:12:40,294][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:12:41,090][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:12:41,884][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:12:42,678][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:12:43,474][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:12:45,121][mllm.training.trainer_common][INFO] - For task: Apply reinforce step
,
            ΔVRAM Allocated: 2.288818359375e-05 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:26,
            Percentage of VRAM taken: 69.91579512747062%,
            
[2025-08-20 18:12:46,057][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/policy_optimizer_state.pt
[2025-08-20 18:12:46,059][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/critic_optimizer_state.pt
[2025-08-20 18:12:47,375][__main__][INFO] - Iteration 647 took 54s (37.65% Gen, 62.34% Train). Generation: 20s, Training: 34s. Estimated remaining time: 5h 11m 47s. Estimated total time: 15h 14m 3s. Time estimates for 10 more iterations: 9m 8s, 100 more iterations: 1h 31m 24s, 500 more iterations: 7h 37m 1s.
[2025-08-20 18:12:47,377][__main__][INFO] - Starting iteration 647.
[2025-08-20 18:13:13,155][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 18:13:13,157][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 18:13:13,163][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 18:13:15,634][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 18:13:15,635][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 18:13:15,642][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 18:13:15,645][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data.
[2025-08-20 18:13:15,646][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets.
[2025-08-20 18:13:15,946][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:13:16,736][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:13:17,527][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:13:18,322][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:13:19,117][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:13:19,910][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:13:20,704][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:13:21,500][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:13:22,292][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:13:23,674][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:13:24,464][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:13:25,257][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:13:26,048][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:13:26,840][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:13:27,632][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:13:28,423][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:13:29,217][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:13:30,009][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:13:30,800][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:13:31,595][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:13:32,386][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:13:33,179][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:13:33,973][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:13:35,201][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:13:35,996][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:13:36,791][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:13:37,585][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:13:38,379][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:13:39,176][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:13:39,971][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:13:40,765][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:13:41,560][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:13:43,123][mllm.training.trainer_common][INFO] - For task: Apply reinforce step
,
            ΔVRAM Allocated: 2.288818359375e-05 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:27,
            Percentage of VRAM taken: 69.91579512747062%,
            
[2025-08-20 18:13:44,089][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/policy_optimizer_state.pt
[2025-08-20 18:13:44,091][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/critic_optimizer_state.pt
[2025-08-20 18:13:45,548][__main__][INFO] - Iteration 648 took 58s (40.09% Gen, 59.91% Train). Generation: 23s, Training: 34s. Estimated remaining time: 6h 6m 16s. Estimated total time: 16h 9m 30s. Time estimates for 10 more iterations: 9m 41s, 100 more iterations: 1h 36m 57s, 500 more iterations: 8h 4m 45s.
[2025-08-20 18:13:45,550][__main__][INFO] - Starting iteration 648.
[2025-08-20 18:14:08,799][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 18:14:08,800][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 18:14:08,806][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 18:14:11,283][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 18:14:11,284][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 18:14:11,291][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 18:14:11,293][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data.
[2025-08-20 18:14:11,294][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets.
[2025-08-20 18:14:11,592][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:14:12,384][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:14:13,178][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:14:13,971][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:14:14,766][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:14:15,562][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:14:16,354][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:14:17,146][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:14:17,942][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:14:18,733][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:14:19,525][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:14:20,321][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:14:21,115][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:14:21,908][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:14:22,704][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:14:23,500][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:14:24,293][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:14:25,087][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:14:25,883][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:14:26,680][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:14:27,475][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:14:28,268][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:14:29,558][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:14:30,352][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:14:31,146][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:14:31,942][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:14:32,738][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:14:33,532][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:14:34,326][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:14:35,122][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:14:35,918][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:14:36,713][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:14:38,297][mllm.training.trainer_common][INFO] - For task: Apply reinforce step
,
            ΔVRAM Allocated: 2.288818359375e-05 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:26,
            Percentage of VRAM taken: 69.91579512747062%,
            
[2025-08-20 18:14:39,263][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/policy_optimizer_state.pt
[2025-08-20 18:14:39,264][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/critic_optimizer_state.pt
[2025-08-20 18:14:40,486][__main__][INFO] - Iteration 649 took 54s (37.84% Gen, 62.16% Train). Generation: 20s, Training: 34s. Estimated remaining time: 5h 11m 26s. Estimated total time: 15h 15m 36s. Time estimates for 10 more iterations: 9m 9s, 100 more iterations: 1h 31m 33s, 500 more iterations: 7h 37m 48s.
[2025-08-20 18:14:40,487][__main__][INFO] - Starting iteration 649.
[2025-08-20 18:15:03,595][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 18:15:03,596][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 18:15:03,602][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 18:15:06,034][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 18:15:06,035][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 18:15:06,042][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 18:15:06,044][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data.
[2025-08-20 18:15:06,045][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets.
[2025-08-20 18:15:06,343][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:15:07,134][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:15:07,930][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:15:08,725][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:15:09,519][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:15:10,312][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:15:11,108][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:15:11,902][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:15:12,695][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:15:13,488][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:15:14,283][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:15:15,076][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:15:15,869][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:15:16,663][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:15:17,457][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:15:18,250][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:15:19,046][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:15:19,841][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:15:21,044][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:15:21,836][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:15:22,627][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:15:23,422][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:15:24,214][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:15:25,006][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:15:25,802][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:15:26,594][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:15:27,386][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:15:28,182][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:15:28,976][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:15:29,768][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:15:30,562][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:15:31,356][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:15:32,917][mllm.training.trainer_common][INFO] - For task: Apply reinforce step
,
            ΔVRAM Allocated: 2.288818359375e-05 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:26,
            Percentage of VRAM taken: 69.91579512747062%,
            
[2025-08-20 18:15:33,924][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/policy_optimizer_state.pt
[2025-08-20 18:15:33,926][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/critic_optimizer_state.pt
[2025-08-20 18:15:35,272][__main__][INFO] - Iteration 650 took 54s (37.73% Gen, 62.27% Train). Generation: 20s, Training: 34s. Estimated remaining time: 5h 8m 0s. Estimated total time: 15h 13m 4s. Time estimates for 10 more iterations: 9m 7s, 100 more iterations: 1h 31m 18s, 500 more iterations: 7h 36m 32s.
[2025-08-20 18:15:35,274][__main__][INFO] - Starting iteration 650.
[2025-08-20 18:15:58,756][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 18:15:58,757][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 18:15:58,763][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 18:16:01,243][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 18:16:01,244][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 18:16:01,251][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 18:16:01,253][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data.
[2025-08-20 18:16:01,253][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets.
[2025-08-20 18:16:01,554][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:16:02,348][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:16:03,141][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:16:03,935][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:16:04,732][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:16:05,527][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:16:06,320][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:16:07,113][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:16:07,909][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:16:08,702][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:16:09,494][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:16:10,290][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:16:11,085][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:16:11,878][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:16:12,672][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:16:13,469][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:16:14,265][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:16:15,539][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:16:16,333][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:16:17,129][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:16:17,926][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:16:18,722][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:16:19,515][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:16:20,309][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:16:21,106][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:16:21,902][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:16:22,695][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:16:23,490][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:16:24,286][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:16:25,083][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:16:25,878][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:16:26,672][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:16:28,277][mllm.training.trainer_common][INFO] - For task: Apply reinforce step
,
            ΔVRAM Allocated: 2.288818359375e-05 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:26,
            Percentage of VRAM taken: 69.91579512747062%,
            
[2025-08-20 18:16:29,277][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/policy_optimizer_state.pt
[2025-08-20 18:16:29,279][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/critic_optimizer_state.pt
[2025-08-20 18:16:33,231][__main__][INFO] - Iteration 651 took 57s (36.27% Gen, 59.15% Train). Generation: 21s, Training: 34s. Estimated remaining time: 5h 59m 54s. Estimated total time: 16h 5m 56s. Time estimates for 10 more iterations: 9m 39s, 100 more iterations: 1h 36m 35s, 500 more iterations: 8h 2m 58s.
[2025-08-20 18:16:33,232][__main__][INFO] - Starting iteration 651.
[2025-08-20 18:16:56,474][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 18:16:56,476][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 18:16:56,482][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 18:16:58,960][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 18:16:58,961][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 18:16:58,968][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 18:16:58,970][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data.
[2025-08-20 18:16:58,971][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets.
[2025-08-20 18:16:59,269][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:17:00,060][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:17:00,854][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:17:01,646][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:17:02,437][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:17:03,231][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:17:04,023][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:17:04,815][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:17:05,611][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:17:06,405][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:17:07,197][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:17:07,992][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:17:08,787][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:17:09,580][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:17:10,373][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:17:11,169][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:17:11,963][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:17:12,756][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:17:13,551][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:17:14,346][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:17:15,140][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:17:15,933][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:17:17,145][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:17:17,938][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:17:18,731][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:17:19,527][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:17:20,322][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:17:21,115][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:17:21,910][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:17:22,707][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:17:23,503][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:17:24,296][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:17:25,904][mllm.training.trainer_common][INFO] - For task: Apply reinforce step
,
            ΔVRAM Allocated: 2.288818359375e-05 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:26,
            Percentage of VRAM taken: 69.91579512747062%,
            
[2025-08-20 18:17:26,833][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/policy_optimizer_state.pt
[2025-08-20 18:17:26,834][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/critic_optimizer_state.pt
[2025-08-20 18:17:28,203][__main__][INFO] - Iteration 652 took 54s (37.79% Gen, 62.21% Train). Generation: 20s, Training: 34s. Estimated remaining time: 5h 9m 13s. Estimated total time: 15h 16m 10s. Time estimates for 10 more iterations: 9m 9s, 100 more iterations: 1h 31m 37s, 500 more iterations: 7h 38m 5s.
[2025-08-20 18:17:28,205][__main__][INFO] - Starting iteration 652.
[2025-08-20 18:17:51,455][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 18:17:51,456][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 18:17:51,463][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 18:17:53,921][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 18:17:53,922][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 18:17:53,929][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 18:17:53,931][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data.
[2025-08-20 18:17:53,932][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets.
[2025-08-20 18:17:54,231][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:17:55,021][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:17:55,814][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:17:56,606][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:17:57,398][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:17:58,194][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:17:58,987][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:17:59,779][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:18:00,575][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:18:01,369][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:18:02,162][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:18:02,956][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:18:03,751][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:18:04,545][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:18:05,337][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:18:06,133][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:18:06,929][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:18:07,722][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:18:08,515][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:18:09,312][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:18:10,653][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:18:11,448][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:18:12,240][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:18:13,033][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:18:13,831][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:18:14,627][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:18:15,420][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:18:16,213][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:18:17,009][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:18:17,803][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:18:18,596][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:18:19,391][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:18:20,960][mllm.training.trainer_common][INFO] - For task: Apply reinforce step
,
            ΔVRAM Allocated: 2.288818359375e-05 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:26,
            Percentage of VRAM taken: 69.91579512747062%,
            
[2025-08-20 18:18:21,927][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/policy_optimizer_state.pt
[2025-08-20 18:18:21,929][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/critic_optimizer_state.pt
[2025-08-20 18:18:23,333][__main__][INFO] - Iteration 653 took 55s (37.71% Gen, 62.28% Train). Generation: 20s, Training: 34s. Estimated remaining time: 5h 10m 55s. Estimated total time: 15h 18m 47s. Time estimates for 10 more iterations: 9m 11s, 100 more iterations: 1h 31m 52s, 500 more iterations: 7h 39m 23s.
[2025-08-20 18:18:23,334][__main__][INFO] - Starting iteration 653.
[2025-08-20 18:18:46,590][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 18:18:46,591][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 18:18:46,597][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 18:18:49,058][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 18:18:49,059][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 18:18:49,065][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 18:18:49,067][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data.
[2025-08-20 18:18:49,068][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets.
[2025-08-20 18:18:49,371][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:18:50,161][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:18:50,954][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:18:51,745][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:18:52,538][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:18:53,331][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:18:54,123][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:18:54,918][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:18:55,713][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:18:56,506][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:18:57,299][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:18:58,096][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:18:58,891][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:18:59,684][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:19:00,478][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:19:01,275][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:19:02,071][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:19:02,864][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:19:03,658][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:19:04,455][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:19:05,251][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:19:06,045][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:19:06,838][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:19:07,635][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:19:08,433][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:19:09,229][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:19:10,559][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:19:11,353][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:19:12,151][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:19:12,946][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:19:13,739][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:19:14,534][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:19:16,110][mllm.training.trainer_common][INFO] - For task: Apply reinforce step
,
            ΔVRAM Allocated: 2.288818359375e-05 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:27,
            Percentage of VRAM taken: 69.91579512747062%,
            
[2025-08-20 18:19:17,075][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/policy_optimizer_state.pt
[2025-08-20 18:19:17,077][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/critic_optimizer_state.pt
[2025-08-20 18:19:18,321][__main__][INFO] - Iteration 654 took 54s (37.81% Gen, 62.19% Train). Generation: 20s, Training: 34s. Estimated remaining time: 5h 7m 38s. Estimated total time: 15h 16m 26s. Time estimates for 10 more iterations: 9m 9s, 100 more iterations: 1h 31m 38s, 500 more iterations: 7h 38m 13s.
[2025-08-20 18:19:18,322][__main__][INFO] - Starting iteration 654.
[2025-08-20 18:19:41,522][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 18:19:41,523][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 18:19:41,530][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 18:19:43,984][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 18:19:43,985][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 18:19:43,992][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 18:19:43,994][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data.
[2025-08-20 18:19:43,994][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets.
[2025-08-20 18:19:44,293][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:19:45,082][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:19:45,875][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:19:46,667][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:19:47,461][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:19:48,256][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:19:49,048][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:19:49,842][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:19:50,638][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:19:51,433][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:19:52,226][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:19:53,020][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:19:53,816][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:19:54,609][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:19:55,403][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:19:56,199][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:19:57,416][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:19:58,208][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:19:59,002][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:19:59,795][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:20:00,588][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:20:01,381][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:20:02,176][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:20:02,971][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:20:03,764][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:20:04,568][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:20:05,362][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:20:06,157][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:20:06,953][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:20:07,746][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:20:08,539][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:20:09,335][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:20:10,931][mllm.training.trainer_common][INFO] - For task: Apply reinforce step
,
            ΔVRAM Allocated: 2.288818359375e-05 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:26,
            Percentage of VRAM taken: 69.91579512747062%,
            
[2025-08-20 18:20:11,884][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/policy_optimizer_state.pt
[2025-08-20 18:20:11,886][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/critic_optimizer_state.pt
[2025-08-20 18:20:13,254][__main__][INFO] - Iteration 655 took 54s (37.77% Gen, 62.23% Train). Generation: 20s, Training: 34s. Estimated remaining time: 5h 5m 48s. Estimated total time: 15h 15m 31s. Time estimates for 10 more iterations: 9m 9s, 100 more iterations: 1h 31m 33s, 500 more iterations: 7h 37m 45s.
[2025-08-20 18:20:13,260][__main__][INFO] - Starting iteration 655.
[2025-08-20 18:20:36,873][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 18:20:36,878][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 18:20:36,888][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 18:20:39,356][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 18:20:39,357][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 18:20:39,364][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 18:20:39,366][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data.
[2025-08-20 18:20:39,367][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets.
[2025-08-20 18:20:39,687][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:20:40,482][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:20:41,273][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:20:42,065][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:20:42,858][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:20:43,649][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:20:44,443][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:20:45,236][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:20:46,029][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:20:46,824][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:20:47,619][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:20:48,411][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:20:49,205][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:20:50,001][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:20:50,795][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:20:51,588][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:20:52,383][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:20:53,180][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:20:53,975][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:20:55,270][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:20:56,063][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:20:56,860][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:20:57,655][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:20:58,447][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:20:59,242][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:21:00,039][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:21:00,836][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:21:01,630][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:21:02,424][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:21:03,220][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:21:04,016][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:21:04,811][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:21:06,389][mllm.training.trainer_common][INFO] - For task: Apply reinforce step
,
            ΔVRAM Allocated: 2.288818359375e-05 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:26,
            Percentage of VRAM taken: 69.91579512747062%,
            
[2025-08-20 18:21:07,321][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/policy_optimizer_state.pt
[2025-08-20 18:21:07,322][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/critic_optimizer_state.pt
[2025-08-20 18:21:08,589][__main__][INFO] - Iteration 656 took 55s (38.20% Gen, 61.80% Train). Generation: 21s, Training: 34s. Estimated remaining time: 5h 11m 24s. Estimated total time: 15h 22m 2s. Time estimates for 10 more iterations: 9m 13s, 100 more iterations: 1h 32m 12s, 500 more iterations: 7h 41m 1s.
[2025-08-20 18:21:08,591][__main__][INFO] - Starting iteration 656.
[2025-08-20 18:21:31,825][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 18:21:31,826][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 18:21:31,832][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 18:21:34,257][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 18:21:34,258][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 18:21:34,264][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 18:21:34,267][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data.
[2025-08-20 18:21:34,267][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets.
[2025-08-20 18:21:34,573][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:21:35,366][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:21:36,158][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:21:36,950][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:21:37,746][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:21:38,539][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:21:39,331][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:21:40,126][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:21:40,921][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:21:41,713][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:21:42,506][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:21:43,301][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:21:44,094][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:21:44,887][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:21:45,683][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:21:46,476][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:21:47,269][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:21:48,065][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:21:48,862][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:21:49,656][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:21:50,873][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:21:51,666][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:21:52,460][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:21:53,252][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:21:54,044][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:21:54,838][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:21:55,631][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:21:56,423][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:21:57,219][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:21:58,014][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:21:58,806][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:21:59,600][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:22:01,183][mllm.training.trainer_common][INFO] - For task: Apply reinforce step
,
            ΔVRAM Allocated: 2.288818359375e-05 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:26,
            Percentage of VRAM taken: 69.91579512747062%,
            
[2025-08-20 18:22:02,079][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/policy_optimizer_state.pt
[2025-08-20 18:22:02,080][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/critic_optimizer_state.pt
[2025-08-20 18:22:03,368][__main__][INFO] - Iteration 657 took 54s (37.96% Gen, 62.04% Train). Generation: 20s, Training: 33s. Estimated remaining time: 5h 1m 25s. Estimated total time: 15h 12m 57s. Time estimates for 10 more iterations: 9m 7s, 100 more iterations: 1h 31m 17s, 500 more iterations: 7h 36m 28s.
[2025-08-20 18:22:03,370][__main__][INFO] - Starting iteration 657.
[2025-08-20 18:22:26,612][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 18:22:26,613][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 18:22:26,620][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 18:22:29,073][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 18:22:29,074][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 18:22:29,081][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 18:22:29,083][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data.
[2025-08-20 18:22:29,084][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets.
[2025-08-20 18:22:29,382][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:22:30,171][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:22:30,964][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:22:31,754][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:22:32,548][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:22:33,340][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:22:34,132][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:22:34,927][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:22:35,719][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:22:36,511][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:22:37,307][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:22:38,099][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:22:38,891][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:22:39,687][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:22:40,480][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:22:41,273][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:22:42,067][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:22:42,862][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:22:43,655][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:22:44,448][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:22:45,244][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:22:46,582][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:22:47,376][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:22:48,170][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:22:48,966][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:22:49,763][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:22:50,557][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:22:51,351][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:22:52,147][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:22:52,945][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:22:53,743][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:22:54,539][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:22:56,099][mllm.training.trainer_common][INFO] - For task: Apply reinforce step
,
            ΔVRAM Allocated: 2.288818359375e-05 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:26,
            Percentage of VRAM taken: 69.91579512747062%,
            
[2025-08-20 18:22:57,069][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/policy_optimizer_state.pt
[2025-08-20 18:22:57,071][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/critic_optimizer_state.pt
[2025-08-20 18:22:58,404][__main__][INFO] - Iteration 658 took 55s (37.77% Gen, 62.23% Train). Generation: 20s, Training: 34s. Estimated remaining time: 5h 4m 46s. Estimated total time: 15h 17m 13s. Time estimates for 10 more iterations: 9m 10s, 100 more iterations: 1h 31m 43s, 500 more iterations: 7h 38m 36s.
[2025-08-20 18:22:58,405][__main__][INFO] - Starting iteration 658.
[2025-08-20 18:23:21,634][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 18:23:21,635][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 18:23:21,641][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 18:23:24,093][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 18:23:24,094][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 18:23:24,101][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 18:23:24,103][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data.
[2025-08-20 18:23:24,104][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets.
[2025-08-20 18:23:24,402][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:23:25,193][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:23:25,987][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:23:26,779][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:23:27,573][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:23:28,367][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:23:29,160][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:23:29,953][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:23:30,749][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:23:31,541][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:23:32,334][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:23:33,130][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:23:33,926][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:23:34,719][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:23:35,513][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:23:36,310][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:23:37,105][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:23:37,898][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:23:38,692][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:23:39,488][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:23:40,284][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:23:41,078][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:23:41,871][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:23:42,668][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:23:43,463][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:23:44,776][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:23:45,571][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:23:46,367][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:23:47,163][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:23:47,956][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:23:48,751][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:23:49,548][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:23:51,110][mllm.training.trainer_common][INFO] - For task: Apply reinforce step
,
            ΔVRAM Allocated: 2.288818359375e-05 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:26,
            Percentage of VRAM taken: 69.91579512747062%,
            
[2025-08-20 18:23:52,064][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/policy_optimizer_state.pt
[2025-08-20 18:23:52,066][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/critic_optimizer_state.pt
[2025-08-20 18:23:53,333][__main__][INFO] - Iteration 659 took 54s (37.84% Gen, 62.16% Train). Generation: 20s, Training: 34s. Estimated remaining time: 5h 2m 4s. Estimated total time: 15h 15m 27s. Time estimates for 10 more iterations: 9m 9s, 100 more iterations: 1h 31m 32s, 500 more iterations: 7h 37m 43s.
[2025-08-20 18:23:53,334][__main__][INFO] - Starting iteration 659.
[2025-08-20 18:24:16,771][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 18:24:16,773][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 18:24:16,779][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 18:24:19,291][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 18:24:19,292][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 18:24:19,298][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 18:24:19,301][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data.
[2025-08-20 18:24:19,302][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets.
[2025-08-20 18:24:19,601][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:24:20,395][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:24:21,188][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:24:21,981][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:24:22,776][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:24:23,570][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:24:24,362][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:24:25,156][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:24:25,950][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:24:26,742][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:24:27,536][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:24:28,332][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:24:29,126][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:24:29,920][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:24:30,715][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:24:31,512][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:24:32,307][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:24:33,100][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:24:33,894][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:24:34,691][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:24:35,897][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:24:36,690][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:24:37,482][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:24:38,273][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:24:39,068][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:24:39,860][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:24:40,651][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:24:41,446][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:24:42,240][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:24:43,032][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:24:43,827][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:24:44,620][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:24:46,179][mllm.training.trainer_common][INFO] - For task: Apply reinforce step
,
            ΔVRAM Allocated: 2.288818359375e-05 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:26,
            Percentage of VRAM taken: 69.91579512747062%,
            
[2025-08-20 18:24:47,136][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/policy_optimizer_state.pt
[2025-08-20 18:24:47,137][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/critic_optimizer_state.pt
[2025-08-20 18:24:48,441][__main__][INFO] - Iteration 660 took 55s (38.01% Gen, 61.98% Train). Generation: 20s, Training: 34s. Estimated remaining time: 5h 4m 9s. Estimated total time: 15h 18m 26s. Time estimates for 10 more iterations: 9m 11s, 100 more iterations: 1h 31m 50s, 500 more iterations: 7h 39m 13s.
[2025-08-20 18:24:48,443][__main__][INFO] - Starting iteration 660.
[2025-08-20 18:25:12,294][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 18:25:12,296][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 18:25:12,302][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 18:25:14,767][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 18:25:14,769][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 18:25:14,776][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 18:25:14,778][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data.
[2025-08-20 18:25:14,778][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets.
[2025-08-20 18:25:15,079][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:25:15,871][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:25:16,663][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:25:17,458][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:25:18,252][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:25:19,042][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:25:19,837][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:25:20,630][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:25:21,421][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:25:22,216][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:25:23,011][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:25:23,804][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:25:24,598][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:25:25,394][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:25:26,188][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:25:26,981][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:25:27,776][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:25:28,572][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:25:29,366][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:25:30,161][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:25:30,956][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:25:31,754][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:25:32,549][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:25:33,919][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:25:34,714][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:25:35,510][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:25:36,304][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:25:37,097][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:25:37,894][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:25:38,691][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:25:39,487][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:25:40,280][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:25:41,915][mllm.training.trainer_common][INFO] - For task: Apply reinforce step
,
            ΔVRAM Allocated: 2.288818359375e-05 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:27,
            Percentage of VRAM taken: 69.91579512747062%,
            
[2025-08-20 18:25:42,856][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/policy_optimizer_state.pt
[2025-08-20 18:25:42,858][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/critic_optimizer_state.pt
[2025-08-20 18:25:44,116][__main__][INFO] - Iteration 661 took 55s (38.40% Gen, 61.60% Train). Generation: 21s, Training: 34s. Estimated remaining time: 5h 12m 39s. Estimated total time: 15h 27m 52s. Time estimates for 10 more iterations: 9m 16s, 100 more iterations: 1h 32m 47s, 500 more iterations: 7h 43m 56s.
[2025-08-20 18:25:44,117][__main__][INFO] - Starting iteration 661.
[2025-08-20 18:26:07,385][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 18:26:07,386][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 18:26:07,393][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 18:26:09,815][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 18:26:09,816][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 18:26:09,823][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 18:26:09,825][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data.
[2025-08-20 18:26:09,825][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets.
[2025-08-20 18:26:10,123][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:26:10,917][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:26:11,709][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:26:12,500][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:26:13,293][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:26:14,089][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:26:14,881][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:26:15,675][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:26:16,466][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:26:17,259][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:26:18,053][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:26:18,846][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:26:19,640][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:26:20,436][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:26:21,229][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:26:22,021][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:26:22,818][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:26:23,613][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:26:24,405][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:26:25,614][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:26:26,407][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:26:27,199][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:26:27,994][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:26:28,787][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:26:29,580][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:26:30,375][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:26:31,169][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:26:31,961][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:26:32,756][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:26:33,551][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:26:34,345][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:26:35,138][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:26:36,727][mllm.training.trainer_common][INFO] - For task: Apply reinforce step
,
            ΔVRAM Allocated: 2.288818359375e-05 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:26,
            Percentage of VRAM taken: 69.91579512747062%,
            
[2025-08-20 18:26:37,734][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/policy_optimizer_state.pt
[2025-08-20 18:26:37,736][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/critic_optimizer_state.pt
[2025-08-20 18:26:39,079][__main__][INFO] - Iteration 662 took 54s (37.88% Gen, 62.12% Train). Generation: 20s, Training: 34s. Estimated remaining time: 4h 59m 52s. Estimated total time: 15h 16m 0s. Time estimates for 10 more iterations: 9m 9s, 100 more iterations: 1h 31m 36s, 500 more iterations: 7h 38m 0s.
[2025-08-20 18:26:39,080][__main__][INFO] - Starting iteration 662.
[2025-08-20 18:27:02,236][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 18:27:02,237][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 18:27:02,243][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 18:27:04,691][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 18:27:04,692][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 18:27:04,699][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 18:27:04,701][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data.
[2025-08-20 18:27:04,701][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets.
[2025-08-20 18:27:05,000][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:27:05,790][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:27:06,583][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:27:07,373][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:27:08,164][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:27:08,965][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:27:09,760][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:27:10,551][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:27:11,344][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:27:12,140][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:27:12,932][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:27:13,724][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:27:14,520][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:27:15,313][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:27:16,106][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:27:16,900][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:27:17,695][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:27:18,487][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:27:19,834][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:27:20,628][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:27:21,422][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:27:22,219][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:27:23,015][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:27:23,809][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:27:24,603][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:27:25,398][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:27:26,194][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:27:26,988][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:27:27,782][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:27:28,578][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:27:29,375][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:27:30,170][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:27:31,752][mllm.training.trainer_common][INFO] - For task: Apply reinforce step
,
            ΔVRAM Allocated: 2.288818359375e-05 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:27,
            Percentage of VRAM taken: 69.91579512747062%,
            
[2025-08-20 18:27:32,700][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/policy_optimizer_state.pt
[2025-08-20 18:27:32,702][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/critic_optimizer_state.pt
[2025-08-20 18:27:34,170][__main__][INFO] - Iteration 663 took 55s (37.57% Gen, 62.43% Train). Generation: 20s, Training: 34s. Estimated remaining time: 5h 1m 6s. Estimated total time: 15h 18m 9s. Time estimates for 10 more iterations: 9m 10s, 100 more iterations: 1h 31m 48s, 500 more iterations: 7h 39m 4s.
[2025-08-20 18:27:34,171][__main__][INFO] - Starting iteration 663.
[2025-08-20 18:27:57,422][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 18:27:57,424][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 18:27:57,430][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 18:27:59,898][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 18:27:59,899][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 18:27:59,906][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 18:27:59,909][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data.
[2025-08-20 18:27:59,909][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets.
[2025-08-20 18:28:00,208][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:28:01,001][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:28:01,793][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:28:02,587][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:28:03,381][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:28:04,173][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:28:04,967][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:28:05,761][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:28:06,554][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:28:07,348][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:28:08,145][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:28:08,939][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:28:09,732][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:28:10,526][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:28:11,323][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:28:12,119][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:28:12,913][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:28:13,707][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:28:14,503][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:28:15,298][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:28:16,092][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:28:16,886][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:28:17,684][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:28:18,480][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:28:19,796][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:28:20,590][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:28:21,382][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:28:22,179][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:28:22,972][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:28:23,764][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:28:24,560][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:28:25,354][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:28:26,962][mllm.training.trainer_common][INFO] - For task: Apply reinforce step
,
            ΔVRAM Allocated: 2.288818359375e-05 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:27,
            Percentage of VRAM taken: 69.91579512747062%,
            
[2025-08-20 18:28:28,406][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/policy_optimizer_state.pt
[2025-08-20 18:28:28,408][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/critic_optimizer_state.pt
[2025-08-20 18:28:29,653][__main__][INFO] - Iteration 664 took 55s (37.47% Gen, 62.53% Train). Generation: 20s, Training: 34s. Estimated remaining time: 5h 6m 38s. Estimated total time: 15h 24m 37s. Time estimates for 10 more iterations: 9m 14s, 100 more iterations: 1h 32m 27s, 500 more iterations: 7h 42m 18s.
[2025-08-20 18:28:29,655][__main__][INFO] - Starting iteration 664.
[2025-08-20 18:28:53,024][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 18:28:53,025][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 18:28:53,031][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 18:28:55,465][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 18:28:55,466][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 18:28:55,473][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 18:28:55,475][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data.
[2025-08-20 18:28:55,476][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets.
[2025-08-20 18:28:55,774][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:28:56,566][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:28:57,357][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:28:58,149][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:28:58,941][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:28:59,733][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:29:00,528][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:29:01,322][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:29:02,114][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:29:02,957][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:29:03,750][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:29:04,546][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:29:05,339][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:29:06,133][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:29:06,928][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:29:07,725][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:29:09,058][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:29:09,851][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:29:10,647][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:29:11,443][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:29:12,237][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:29:13,441][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:29:14,235][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:29:15,029][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:29:15,826][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:29:16,622][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:29:17,416][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:29:18,209][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:29:19,005][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:29:19,802][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:29:20,598][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:29:21,391][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:29:22,958][mllm.training.trainer_common][INFO] - For task: Apply reinforce step
,
            ΔVRAM Allocated: 2.288818359375e-05 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:27,
            Percentage of VRAM taken: 69.91579512747062%,
            
[2025-08-20 18:29:23,910][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/policy_optimizer_state.pt
[2025-08-20 18:29:23,912][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/critic_optimizer_state.pt
[2025-08-20 18:29:25,295][__main__][INFO] - Iteration 665 took 55s (37.59% Gen, 62.41% Train). Generation: 20s, Training: 34s. Estimated remaining time: 5h 8m 25s. Estimated total time: 15h 27m 19s. Time estimates for 10 more iterations: 9m 16s, 100 more iterations: 1h 32m 43s, 500 more iterations: 7h 43m 39s.
[2025-08-20 18:29:25,297][__main__][INFO] - Starting iteration 665.
[2025-08-20 18:29:48,877][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 18:29:48,879][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 18:29:48,885][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 18:29:51,346][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 18:29:51,347][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 18:29:51,354][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 18:29:51,356][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data.
[2025-08-20 18:29:51,356][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets.
[2025-08-20 18:29:51,656][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:29:52,448][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:29:53,238][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:29:54,031][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:29:54,824][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:29:55,615][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:29:56,411][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:29:57,205][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:29:57,998][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:29:58,792][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:29:59,588][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:30:00,381][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:30:01,175][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:30:01,970][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:30:02,766][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:30:03,560][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:30:04,353][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:30:05,149][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:30:05,945][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:30:06,738][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:30:07,533][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:30:08,329][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:30:09,125][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:30:10,457][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:30:11,251][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:30:12,047][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:30:12,843][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:30:13,636][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:30:14,429][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:30:15,226][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:30:16,022][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:30:16,816][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:30:18,427][mllm.training.trainer_common][INFO] - For task: Apply reinforce step
,
            ΔVRAM Allocated: 2.288818359375e-05 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:27,
            Percentage of VRAM taken: 69.91579512747062%,
            
[2025-08-20 18:30:19,376][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/policy_optimizer_state.pt
[2025-08-20 18:30:19,378][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/critic_optimizer_state.pt
[2025-08-20 18:30:20,688][__main__][INFO] - Iteration 666 took 55s (38.15% Gen, 61.85% Train). Generation: 21s, Training: 34s. Estimated remaining time: 5h 3m 21s. Estimated total time: 15h 23m 11s. Time estimates for 10 more iterations: 9m 13s, 100 more iterations: 1h 32m 19s, 500 more iterations: 7h 41m 35s.
[2025-08-20 18:30:20,693][__main__][INFO] - Starting iteration 666.
[2025-08-20 18:30:43,977][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 18:30:43,979][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 18:30:43,985][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 18:30:46,421][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 18:30:46,423][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 18:30:46,430][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 18:30:46,431][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data.
[2025-08-20 18:30:46,432][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets.
[2025-08-20 18:30:46,734][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:30:47,522][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:30:48,313][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:30:49,103][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:30:49,893][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:30:50,684][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:30:51,475][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:30:52,269][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:30:53,061][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:30:53,854][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:30:54,648][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:30:55,440][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:30:56,233][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:30:57,027][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:30:57,819][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:30:58,613][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:30:59,409][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:31:00,200][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:31:00,994][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:31:01,790][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:31:03,004][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:31:03,795][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:31:04,589][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:31:05,383][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:31:06,176][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:31:06,970][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:31:07,765][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:31:08,557][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:31:09,351][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:31:10,147][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:31:10,941][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:31:11,733][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:31:13,322][mllm.training.trainer_common][INFO] - For task: Apply reinforce step
,
            ΔVRAM Allocated: 2.288818359375e-05 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:26,
            Percentage of VRAM taken: 69.91579512747062%,
            
[2025-08-20 18:31:14,246][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/policy_optimizer_state.pt
[2025-08-20 18:31:14,247][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/critic_optimizer_state.pt
[2025-08-20 18:31:15,574][__main__][INFO] - Iteration 667 took 54s (37.96% Gen, 62.04% Train). Generation: 20s, Training: 34s. Estimated remaining time: 4h 53m 56s. Estimated total time: 15h 14m 40s. Time estimates for 10 more iterations: 9m 8s, 100 more iterations: 1h 31m 28s, 500 more iterations: 7h 37m 20s.
[2025-08-20 18:31:15,575][__main__][INFO] - Starting iteration 667.
[2025-08-20 18:31:39,155][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 18:31:39,156][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 18:31:39,162][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 18:31:41,631][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 18:31:41,632][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 18:31:41,639][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 18:31:41,641][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data.
[2025-08-20 18:31:41,641][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets.
[2025-08-20 18:31:41,941][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:31:42,734][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:31:43,524][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:31:44,317][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:31:45,112][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:31:45,903][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:31:46,698][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:31:47,493][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:31:48,286][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:31:49,079][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:31:49,876][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:31:50,671][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:31:52,422][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:31:53,217][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:31:54,012][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:31:54,806][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:31:55,600][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:31:56,395][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:31:57,191][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:31:57,984][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:31:58,778][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:31:59,575][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:32:00,829][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:32:01,624][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:32:02,417][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:32:03,213][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:32:04,009][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:32:04,803][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:32:05,596][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:32:06,392][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:32:07,189][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:32:07,983][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:32:09,553][mllm.training.trainer_common][INFO] - For task: Apply reinforce step
,
            ΔVRAM Allocated: 2.288818359375e-05 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:27,
            Percentage of VRAM taken: 69.91579512747062%,
            
[2025-08-20 18:32:10,691][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/policy_optimizer_state.pt
[2025-08-20 18:32:10,693][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/critic_optimizer_state.pt
[2025-08-20 18:32:12,096][__main__][INFO] - Iteration 668 took 56s (37.35% Gen, 62.65% Train). Generation: 21s, Training: 35s. Estimated remaining time: 5h 20m 19s. Estimated total time: 15h 42m 0s. Time estimates for 10 more iterations: 9m 25s, 100 more iterations: 1h 34m 12s, 500 more iterations: 7h 51m 0s.
[2025-08-20 18:32:12,098][__main__][INFO] - Starting iteration 668.
[2025-08-20 18:32:35,323][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 18:32:35,324][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 18:32:35,331][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 18:32:37,787][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 18:32:37,788][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 18:32:37,795][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 18:32:37,797][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data.
[2025-08-20 18:32:37,797][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets.
[2025-08-20 18:32:38,100][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:32:38,890][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:32:39,682][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:32:40,476][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:32:41,268][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:32:42,061][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:32:42,857][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:32:43,649][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:32:44,441][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:32:45,242][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:32:46,038][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:32:46,833][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:32:47,626][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:32:48,420][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:32:49,218][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:32:50,013][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:32:50,806][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:32:51,600][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:32:52,397][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:32:53,193][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:32:53,986][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:32:54,780][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:32:55,574][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:32:56,369][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:32:57,163][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:32:57,957][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:32:59,238][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:33:00,030][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:33:00,827][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:33:01,619][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:33:02,413][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:33:03,207][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:33:04,800][mllm.training.trainer_common][INFO] - For task: Apply reinforce step
,
            ΔVRAM Allocated: 2.288818359375e-05 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:26,
            Percentage of VRAM taken: 69.91579512747062%,
            
[2025-08-20 18:33:05,791][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/policy_optimizer_state.pt
[2025-08-20 18:33:05,793][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/critic_optimizer_state.pt
[2025-08-20 18:33:07,489][__main__][INFO] - Iteration 669 took 55s (37.55% Gen, 62.45% Train). Generation: 20s, Training: 34s. Estimated remaining time: 5h 0m 33s. Estimated total time: 15h 23m 10s. Time estimates for 10 more iterations: 9m 13s, 100 more iterations: 1h 32m 19s, 500 more iterations: 7h 41m 35s.
[2025-08-20 18:33:07,490][__main__][INFO] - Starting iteration 669.
[2025-08-20 18:33:30,740][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 18:33:30,741][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 18:33:30,747][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 18:33:33,184][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 18:33:33,186][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 18:33:33,192][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 18:33:33,194][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data.
[2025-08-20 18:33:33,195][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets.
[2025-08-20 18:33:33,493][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:33:34,283][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:33:35,075][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:33:35,867][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:33:36,662][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:33:37,456][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:33:38,249][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:33:39,043][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:33:39,839][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:33:40,632][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:33:41,425][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:33:42,221][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:33:43,017][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:33:43,810][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:33:44,603][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:33:45,400][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:33:46,196][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:33:46,989][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:33:47,783][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:33:48,579][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:33:49,374][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:33:50,168][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:33:51,382][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:33:52,177][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:33:52,973][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:33:53,766][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:33:54,561][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:33:55,358][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:33:56,154][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:33:56,949][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:33:57,742][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:33:58,538][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:34:00,100][mllm.training.trainer_common][INFO] - For task: Apply reinforce step
,
            ΔVRAM Allocated: 2.288818359375e-05 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:26,
            Percentage of VRAM taken: 69.91579512747062%,
            
[2025-08-20 18:34:01,069][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/policy_optimizer_state.pt
[2025-08-20 18:34:01,071][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/critic_optimizer_state.pt
[2025-08-20 18:34:02,409][__main__][INFO] - Iteration 670 took 54s (37.89% Gen, 62.11% Train). Generation: 20s, Training: 34s. Estimated remaining time: 4h 51m 47s. Estimated total time: 15h 15m 18s. Time estimates for 10 more iterations: 9m 9s, 100 more iterations: 1h 31m 31s, 500 more iterations: 7h 37m 39s.
[2025-08-20 18:34:02,411][__main__][INFO] - Starting iteration 670.
[2025-08-20 18:34:26,495][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 18:34:26,496][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 18:34:26,502][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 18:34:28,958][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 18:34:28,959][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 18:34:28,965][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 18:34:28,968][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data.
[2025-08-20 18:34:28,968][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets.
[2025-08-20 18:34:29,272][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:34:30,062][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:34:30,853][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:34:31,647][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:34:32,440][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:34:33,231][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:34:34,024][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:34:34,817][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:34:35,609][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:34:36,404][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:34:37,198][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:34:37,990][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:34:38,784][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:34:39,579][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:34:40,371][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:34:41,165][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:34:41,961][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:34:42,755][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:34:44,063][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:34:44,857][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:34:45,651][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:34:46,445][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:34:47,242][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:34:48,038][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:34:48,832][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:34:49,626][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:34:50,421][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:34:51,218][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:34:52,014][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:34:52,808][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:34:53,602][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:34:54,397][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:34:56,011][mllm.training.trainer_common][INFO] - For task: Apply reinforce step
,
            ΔVRAM Allocated: 2.288818359375e-05 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:27,
            Percentage of VRAM taken: 69.91579512747062%,
            
[2025-08-20 18:34:56,986][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/policy_optimizer_state.pt
[2025-08-20 18:34:56,988][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/critic_optimizer_state.pt
[2025-08-20 18:34:59,058][__main__][INFO] - Iteration 671 took 56s (38.17% Gen, 61.83% Train). Generation: 21s, Training: 35s. Estimated remaining time: 5h 19m 38s. Estimated total time: 15h 44m 6s. Time estimates for 10 more iterations: 9m 26s, 100 more iterations: 1h 34m 24s, 500 more iterations: 7h 52m 3s.
[2025-08-20 18:34:59,059][__main__][INFO] - Starting iteration 671.
[2025-08-20 18:35:22,697][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 18:35:22,698][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 18:35:22,705][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 18:35:25,136][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 18:35:25,137][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 18:35:25,143][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 18:35:25,145][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data.
[2025-08-20 18:35:25,146][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets.
[2025-08-20 18:35:25,444][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:35:26,232][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:35:27,025][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:35:27,817][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:35:28,609][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:35:29,404][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:35:30,196][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:35:30,988][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:35:31,783][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:35:32,575][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:35:33,368][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:35:34,162][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:35:34,954][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:35:35,748][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:35:36,544][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:35:37,337][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:35:38,130][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:35:38,925][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:35:39,719][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:35:40,511][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:35:41,306][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:35:42,102][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:35:42,896][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:35:44,129][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:35:44,924][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:35:45,720][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:35:46,515][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:35:47,308][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:35:48,104][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:35:48,902][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:35:49,698][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:35:50,493][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:35:52,144][mllm.training.trainer_common][INFO] - For task: Apply reinforce step
,
            ΔVRAM Allocated: 2.288818359375e-05 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:26,
            Percentage of VRAM taken: 69.91579512747062%,
            
[2025-08-20 18:35:53,118][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/policy_optimizer_state.pt
[2025-08-20 18:35:53,119][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/critic_optimizer_state.pt
[2025-08-20 18:35:54,467][__main__][INFO] - Iteration 672 took 55s (38.26% Gen, 61.74% Train). Generation: 21s, Training: 34s. Estimated remaining time: 4h 58m 3s. Estimated total time: 15h 23m 27s. Time estimates for 10 more iterations: 9m 14s, 100 more iterations: 1h 32m 20s, 500 more iterations: 7h 41m 43s.
[2025-08-20 18:35:54,469][__main__][INFO] - Starting iteration 672.
[2025-08-20 18:36:18,059][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 18:36:18,061][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 18:36:18,067][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 18:36:20,521][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 18:36:20,522][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 18:36:20,528][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 18:36:20,531][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data.
[2025-08-20 18:36:20,531][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets.
[2025-08-20 18:36:20,830][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:36:21,620][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:36:22,412][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:36:23,205][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:36:23,998][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:36:24,792][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:36:25,588][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:36:26,382][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:36:27,174][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:36:27,970][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:36:28,767][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:36:29,560][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:36:30,354][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:36:31,149][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:36:31,946][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:36:32,740][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:36:34,000][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:36:34,793][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:36:35,586][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:36:36,381][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:36:37,174][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:36:37,968][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:36:38,765][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:36:39,561][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:36:40,354][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:36:41,147][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:36:41,944][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:36:42,738][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:36:43,531][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:36:44,326][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:36:45,123][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:36:45,918][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:36:47,524][mllm.training.trainer_common][INFO] - For task: Apply reinforce step
,
            ΔVRAM Allocated: 2.288818359375e-05 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:26,
            Percentage of VRAM taken: 69.91579512747062%,
            
[2025-08-20 18:36:48,478][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/policy_optimizer_state.pt
[2025-08-20 18:36:48,480][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/critic_optimizer_state.pt
[2025-08-20 18:36:49,838][__main__][INFO] - Iteration 673 took 55s (38.19% Gen, 61.81% Train). Generation: 21s, Training: 34s. Estimated remaining time: 4h 56m 30s. Estimated total time: 15h 22m 49s. Time estimates for 10 more iterations: 9m 13s, 100 more iterations: 1h 32m 16s, 500 more iterations: 7h 41m 24s.
[2025-08-20 18:36:49,840][__main__][INFO] - Starting iteration 673.
[2025-08-20 18:37:13,156][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 18:37:13,157][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 18:37:13,164][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 18:37:15,613][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 18:37:15,615][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 18:37:15,621][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 18:37:15,623][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data.
[2025-08-20 18:37:15,624][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets.
[2025-08-20 18:37:15,922][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:37:16,713][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:37:17,506][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:37:18,297][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:37:19,093][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:37:19,887][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:37:20,679][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:37:21,473][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:37:22,269][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:37:23,062][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:37:23,856][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:37:24,652][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:37:25,447][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:37:26,240][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:37:27,035][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:37:27,833][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:37:28,628][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:37:29,422][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:37:30,216][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:37:31,011][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:37:31,808][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:37:32,604][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:37:33,880][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:37:34,675][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:37:35,470][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:37:36,268][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:37:37,064][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:37:37,858][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:37:38,652][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:37:39,449][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:37:40,246][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:37:41,042][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:37:42,673][mllm.training.trainer_common][INFO] - For task: Apply reinforce step
,
            ΔVRAM Allocated: 2.288818359375e-05 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:27,
            Percentage of VRAM taken: 69.91579512747062%,
            
[2025-08-20 18:37:43,629][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/policy_optimizer_state.pt
[2025-08-20 18:37:43,631][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/critic_optimizer_state.pt
[2025-08-20 18:37:45,133][__main__][INFO] - Iteration 674 took 55s (37.67% Gen, 62.33% Train). Generation: 20s, Training: 34s. Estimated remaining time: 4h 54m 18s. Estimated total time: 15h 21m 32s. Time estimates for 10 more iterations: 9m 12s, 100 more iterations: 1h 32m 9s, 500 more iterations: 7h 40m 46s.
[2025-08-20 18:37:45,135][__main__][INFO] - Starting iteration 674.
[2025-08-20 18:38:08,517][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 18:38:08,518][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 18:38:08,524][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 18:38:10,990][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 18:38:10,991][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 18:38:10,997][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 18:38:10,999][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data.
[2025-08-20 18:38:11,000][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets.
[2025-08-20 18:38:11,298][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:38:12,090][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:38:12,881][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:38:13,676][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:38:14,470][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:38:15,262][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:38:16,057][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:38:16,853][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:38:17,647][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:38:18,440][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:38:19,236][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:38:20,032][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:38:20,825][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:38:21,618][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:38:22,414][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:38:23,211][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:38:24,005][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:38:24,798][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:38:25,593][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:38:26,389][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:38:27,184][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:38:28,400][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:38:29,194][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:38:29,992][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:38:30,788][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:38:31,582][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:38:32,376][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:38:33,173][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:38:33,969][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:38:34,765][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:38:35,560][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:38:36,355][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:38:37,969][mllm.training.trainer_common][INFO] - For task: Apply reinforce step
,
            ΔVRAM Allocated: 2.288818359375e-05 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:26,
            Percentage of VRAM taken: 69.91579512747062%,
            
[2025-08-20 18:38:38,899][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/policy_optimizer_state.pt
[2025-08-20 18:38:38,901][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/critic_optimizer_state.pt
[2025-08-20 18:38:40,612][__main__][INFO] - Iteration 675 took 55s (37.71% Gen, 62.29% Train). Generation: 20s, Training: 34s. Estimated remaining time: 4h 56m 27s. Estimated total time: 15h 24m 37s. Time estimates for 10 more iterations: 9m 14s, 100 more iterations: 1h 32m 27s, 500 more iterations: 7h 42m 18s.
[2025-08-20 18:38:40,614][__main__][INFO] - Starting iteration 675.
[2025-08-20 18:39:04,206][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 18:39:04,207][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 18:39:04,214][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 18:39:06,681][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 18:39:06,682][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 18:39:06,689][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 18:39:06,691][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data.
[2025-08-20 18:39:06,691][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets.
[2025-08-20 18:39:06,992][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:39:07,782][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:39:08,577][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:39:09,370][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:39:10,163][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:39:10,959][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:39:11,754][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:39:12,547][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:39:13,340][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:39:14,137][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:39:14,932][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:39:15,725][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:39:16,520][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:39:17,317][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:39:18,114][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:39:18,909][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:39:19,703][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:39:20,498][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:39:21,295][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:39:22,598][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:39:23,394][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:39:24,190][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:39:24,983][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:39:25,778][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:39:26,575][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:39:27,371][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:39:28,166][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:39:28,960][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:39:29,757][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:39:30,555][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:39:31,353][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:39:32,150][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:39:33,750][mllm.training.trainer_common][INFO] - For task: Apply reinforce step
,
            ΔVRAM Allocated: 2.288818359375e-05 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:27,
            Percentage of VRAM taken: 69.91579512747062%,
            
[2025-08-20 18:39:34,680][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/policy_optimizer_state.pt
[2025-08-20 18:39:34,682][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/critic_optimizer_state.pt
[2025-08-20 18:39:36,101][__main__][INFO] - Iteration 676 took 55s (38.12% Gen, 61.88% Train). Generation: 21s, Training: 34s. Estimated remaining time: 4h 55m 41s. Estimated total time: 15h 24m 46s. Time estimates for 10 more iterations: 9m 14s, 100 more iterations: 1h 32m 28s, 500 more iterations: 7h 42m 23s.
[2025-08-20 18:39:36,102][__main__][INFO] - Starting iteration 676.
[2025-08-20 18:39:59,651][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 18:39:59,652][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 18:39:59,658][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 18:40:02,109][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 18:40:02,110][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 18:40:02,116][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 18:40:02,119][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data.
[2025-08-20 18:40:02,119][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets.
[2025-08-20 18:40:02,429][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:40:03,220][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:40:04,012][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:40:04,804][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:40:05,600][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:40:06,393][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:40:07,186][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:40:07,981][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:40:08,776][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:40:09,568][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:40:10,362][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:40:11,158][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:40:11,951][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:40:12,744][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:40:13,539][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:40:14,334][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:40:15,129][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:40:15,923][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:40:16,717][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:40:17,512][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:40:18,305][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:40:19,100][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:40:20,317][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:40:21,111][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:40:21,905][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:40:22,699][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:40:23,496][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:40:24,292][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:40:25,086][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:40:25,880][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:40:26,677][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:40:27,475][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:40:29,096][mllm.training.trainer_common][INFO] - For task: Apply reinforce step
,
            ΔVRAM Allocated: 2.288818359375e-05 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:26,
            Percentage of VRAM taken: 69.91579512747062%,
            
[2025-08-20 18:40:30,021][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/policy_optimizer_state.pt
[2025-08-20 18:40:30,022][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/critic_optimizer_state.pt
[2025-08-20 18:40:31,361][__main__][INFO] - Iteration 677 took 55s (38.17% Gen, 61.83% Train). Generation: 21s, Training: 34s. Estimated remaining time: 4h 50m 57s. Estimated total time: 15h 20m 58s. Time estimates for 10 more iterations: 9m 12s, 100 more iterations: 1h 32m 5s, 500 more iterations: 7h 40m 29s.
[2025-08-20 18:40:31,362][__main__][INFO] - Starting iteration 677.
[2025-08-20 18:40:55,032][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 18:40:55,033][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 18:40:55,040][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 18:40:57,495][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 18:40:57,496][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 18:40:57,503][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 18:40:57,505][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data.
[2025-08-20 18:40:57,506][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets.
[2025-08-20 18:40:57,805][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:40:58,596][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:40:59,388][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:41:00,184][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:41:00,977][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:41:01,771][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:41:02,565][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:41:03,361][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:41:04,154][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:41:04,946][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:41:05,743][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:41:06,538][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:41:07,331][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:41:08,125][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:41:08,919][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:41:09,712][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:41:10,506][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:41:11,303][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:41:12,101][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:41:12,897][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:41:14,158][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:41:14,953][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:41:15,746][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:41:16,541][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:41:17,339][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:41:18,135][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:41:18,928][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:41:19,723][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:41:20,521][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:41:21,319][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:41:22,116][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:41:22,911][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:41:24,535][mllm.training.trainer_common][INFO] - For task: Apply reinforce step
,
            ΔVRAM Allocated: 2.288818359375e-05 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:26,
            Percentage of VRAM taken: 69.91579512747062%,
            
[2025-08-20 18:41:25,488][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/policy_optimizer_state.pt
[2025-08-20 18:41:25,489][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/critic_optimizer_state.pt
[2025-08-20 18:41:26,753][__main__][INFO] - Iteration 678 took 55s (38.34% Gen, 61.66% Train). Generation: 21s, Training: 34s. Estimated remaining time: 4h 52m 14s. Estimated total time: 15h 23m 10s. Time estimates for 10 more iterations: 9m 13s, 100 more iterations: 1h 32m 19s, 500 more iterations: 7h 41m 35s.
[2025-08-20 18:41:26,754][__main__][INFO] - Starting iteration 678.
[2025-08-20 18:41:49,934][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 18:41:49,935][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 18:41:49,942][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 18:41:52,403][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 18:41:52,404][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 18:41:52,411][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 18:41:52,413][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data.
[2025-08-20 18:41:52,413][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets.
[2025-08-20 18:41:52,716][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:41:53,505][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:41:54,295][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:41:55,086][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:41:55,877][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:41:56,668][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:41:57,461][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:41:58,251][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:41:59,045][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:41:59,838][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:42:00,630][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:42:01,424][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:42:02,218][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:42:03,010][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:42:03,805][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:42:04,598][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:42:05,390][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:42:06,184][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:42:06,979][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:42:08,371][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:42:09,166][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:42:09,962][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:42:10,759][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:42:11,552][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:42:12,346][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:42:13,143][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:42:13,939][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:42:14,735][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:42:15,528][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:42:16,324][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:42:17,122][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:42:17,921][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:42:19,561][mllm.training.trainer_common][INFO] - For task: Apply reinforce step
,
            ΔVRAM Allocated: 2.288818359375e-05 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:27,
            Percentage of VRAM taken: 69.91579512747062%,
            
[2025-08-20 18:42:20,529][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/policy_optimizer_state.pt
[2025-08-20 18:42:20,531][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/critic_optimizer_state.pt
[2025-08-20 18:42:21,815][__main__][INFO] - Iteration 679 took 55s (37.64% Gen, 62.36% Train). Generation: 20s, Training: 34s. Estimated remaining time: 4h 45m 49s. Estimated total time: 15h 17m 40s. Time estimates for 10 more iterations: 9m 10s, 100 more iterations: 1h 31m 46s, 500 more iterations: 7h 38m 50s.
[2025-08-20 18:42:21,816][__main__][INFO] - Starting iteration 679.
[2025-08-20 18:42:45,038][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 18:42:45,040][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 18:42:45,046][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 18:42:47,498][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 18:42:47,499][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 18:42:47,506][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 18:42:47,508][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data.
[2025-08-20 18:42:47,508][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets.
[2025-08-20 18:42:47,812][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:42:48,603][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:42:49,393][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:42:50,188][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:42:50,980][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:42:51,771][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:42:52,566][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:42:53,357][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:42:54,151][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:42:54,946][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:42:55,738][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:42:56,530][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:42:57,326][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:42:58,118][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:42:58,911][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:42:59,707][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:43:00,918][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:43:01,712][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:43:02,507][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:43:03,303][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:43:04,097][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:43:04,890][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:43:05,686][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:43:06,482][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:43:07,275][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:43:08,069][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:43:08,866][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:43:09,664][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:43:10,459][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:43:11,253][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:43:12,048][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:43:12,846][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:43:14,461][mllm.training.trainer_common][INFO] - For task: Apply reinforce step
,
            ΔVRAM Allocated: 2.288818359375e-05 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:26,
            Percentage of VRAM taken: 69.91579512747062%,
            
[2025-08-20 18:43:15,383][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/policy_optimizer_state.pt
[2025-08-20 18:43:15,384][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/critic_optimizer_state.pt
[2025-08-20 18:43:16,734][__main__][INFO] - Iteration 680 took 54s (37.84% Gen, 62.16% Train). Generation: 20s, Training: 34s. Estimated remaining time: 4h 42m 31s. Estimated total time: 15h 15m 17s. Time estimates for 10 more iterations: 9m 9s, 100 more iterations: 1h 31m 31s, 500 more iterations: 7h 37m 38s.
[2025-08-20 18:43:16,735][__main__][INFO] - Starting iteration 680.
[2025-08-20 18:43:40,224][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 18:43:40,226][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 18:43:40,232][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 18:43:42,684][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 18:43:42,685][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 18:43:42,692][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 18:43:42,694][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data.
[2025-08-20 18:43:42,695][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets.
[2025-08-20 18:43:42,995][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:43:43,788][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:43:44,580][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:43:45,372][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:43:46,166][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:43:46,959][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:43:47,753][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:43:48,548][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:43:49,340][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:43:50,135][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:43:50,932][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:43:51,728][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:43:52,521][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:43:53,314][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:43:54,112][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:43:54,905][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:43:55,698][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:43:56,997][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:43:57,791][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:43:58,588][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:43:59,385][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:44:00,179][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:44:00,972][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:44:01,769][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:44:03,854][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:44:04,648][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:44:05,445][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:44:06,241][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:44:07,035][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:44:07,831][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:44:08,628][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:44:09,427][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:44:11,060][mllm.training.trainer_common][INFO] - For task: Apply reinforce step
,
            ΔVRAM Allocated: 2.288818359375e-05 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:28,
            Percentage of VRAM taken: 69.91579512747062%,
            
[2025-08-20 18:44:11,973][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/policy_optimizer_state.pt
[2025-08-20 18:44:11,975][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/critic_optimizer_state.pt
[2025-08-20 18:44:13,254][__main__][INFO] - Iteration 681 took 56s (37.26% Gen, 62.74% Train). Generation: 21s, Training: 35s. Estimated remaining time: 5h 8m 15s. Estimated total time: 15h 41m 58s. Time estimates for 10 more iterations: 9m 25s, 100 more iterations: 1h 34m 11s, 500 more iterations: 7h 50m 59s.
[2025-08-20 18:44:13,255][__main__][INFO] - Starting iteration 681.
[2025-08-20 18:44:36,846][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 18:44:36,847][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 18:44:36,853][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 18:44:39,290][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 18:44:39,291][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 18:44:39,297][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 18:44:39,299][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data.
[2025-08-20 18:44:39,300][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets.
[2025-08-20 18:44:39,598][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:44:40,391][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:44:41,182][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:44:41,976][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:44:42,771][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:44:43,563][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:44:44,356][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:44:45,154][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:44:45,948][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:44:46,741][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:44:47,534][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:44:48,329][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:44:49,122][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:44:49,916][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:44:50,714][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:44:51,510][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:44:52,304][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:44:53,513][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:44:54,309][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:44:55,102][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:44:55,896][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:44:56,691][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:44:57,488][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:44:58,283][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:44:59,077][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:44:59,872][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:45:00,671][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:45:01,469][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:45:02,266][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:45:03,061][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:45:03,856][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:45:04,652][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:45:06,305][mllm.training.trainer_common][INFO] - For task: Apply reinforce step
,
            ΔVRAM Allocated: 2.288818359375e-05 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:26,
            Percentage of VRAM taken: 69.91579512747062%,
            
[2025-08-20 18:45:07,313][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/policy_optimizer_state.pt
[2025-08-20 18:45:07,315][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/critic_optimizer_state.pt
[2025-08-20 18:45:08,760][__main__][INFO] - Iteration 682 took 55s (38.12% Gen, 61.88% Train). Generation: 21s, Training: 34s. Estimated remaining time: 4h 50m 26s. Estimated total time: 15h 25m 4s. Time estimates for 10 more iterations: 9m 15s, 100 more iterations: 1h 32m 30s, 500 more iterations: 7h 42m 32s.
[2025-08-20 18:45:08,761][__main__][INFO] - Starting iteration 682.
[2025-08-20 18:45:32,000][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 18:45:32,002][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 18:45:32,008][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 18:45:34,467][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 18:45:34,469][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 18:45:34,475][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 18:45:34,477][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data.
[2025-08-20 18:45:34,478][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets.
[2025-08-20 18:45:34,777][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:45:35,567][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:45:36,359][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:45:37,153][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:45:37,945][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:45:38,738][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:45:39,533][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:45:40,325][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:45:41,120][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:45:41,915][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:45:42,709][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:45:43,501][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:45:44,296][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:45:45,092][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:45:45,885][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:45:46,678][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:45:47,474][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:45:48,270][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:45:49,613][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:45:50,407][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:45:51,200][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:45:51,995][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:45:52,793][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:45:53,588][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:45:54,382][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:45:55,177][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:45:55,973][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:45:56,769][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:45:57,564][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:45:58,359][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:45:59,155][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:46:00,354][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:46:01,973][mllm.training.trainer_common][INFO] - For task: Apply reinforce step
,
            ΔVRAM Allocated: 2.288818359375e-05 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:27,
            Percentage of VRAM taken: 69.91579512747062%,
            
[2025-08-20 18:46:02,927][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/policy_optimizer_state.pt
[2025-08-20 18:46:02,929][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/critic_optimizer_state.pt
[2025-08-20 18:46:04,233][__main__][INFO] - Iteration 683 took 55s (37.47% Gen, 62.53% Train). Generation: 20s, Training: 34s. Estimated remaining time: 4h 48m 58s. Estimated total time: 15h 24m 31s. Time estimates for 10 more iterations: 9m 14s, 100 more iterations: 1h 32m 27s, 500 more iterations: 7h 42m 15s.
[2025-08-20 18:46:04,234][__main__][INFO] - Starting iteration 683.
[2025-08-20 18:46:27,334][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 18:46:27,335][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 18:46:27,341][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 18:46:29,819][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 18:46:29,820][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 18:46:29,827][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 18:46:29,829][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data.
[2025-08-20 18:46:29,829][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets.
[2025-08-20 18:46:30,127][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:46:30,916][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:46:31,705][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:46:32,497][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:46:33,287][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:46:34,079][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:46:34,871][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:46:35,663][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:46:36,457][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:46:37,248][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:46:38,039][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:46:38,832][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:46:39,623][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:46:40,417][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:46:41,209][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:46:42,000][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:46:42,795][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:46:43,587][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:46:44,379][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:46:45,174][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:46:45,967][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:46:46,759][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:46:48,086][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:46:48,879][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:46:49,676][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:46:50,474][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:46:51,271][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:46:52,066][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:46:52,861][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:46:53,657][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:46:54,456][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:46:55,256][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:46:56,911][mllm.training.trainer_common][INFO] - For task: Apply reinforce step
,
            ΔVRAM Allocated: 2.288818359375e-05 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:27,
            Percentage of VRAM taken: 69.91579512747062%,
            
[2025-08-20 18:46:57,832][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/policy_optimizer_state.pt
[2025-08-20 18:46:57,833][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/critic_optimizer_state.pt
[2025-08-20 18:46:59,114][__main__][INFO] - Iteration 684 took 54s (37.64% Gen, 62.36% Train). Generation: 20s, Training: 34s. Estimated remaining time: 4h 38m 11s. Estimated total time: 15h 14m 39s. Time estimates for 10 more iterations: 9m 8s, 100 more iterations: 1h 31m 27s, 500 more iterations: 7h 37m 19s.
[2025-08-20 18:46:59,115][__main__][INFO] - Starting iteration 684.
[2025-08-20 18:47:22,312][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 18:47:22,313][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 18:47:22,319][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 18:47:24,770][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 18:47:24,771][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 18:47:24,778][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 18:47:24,779][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data.
[2025-08-20 18:47:24,780][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets.
[2025-08-20 18:47:25,085][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:47:25,877][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:47:26,669][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:47:27,463][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:47:28,257][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:47:29,050][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:47:29,845][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:47:30,641][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:47:31,434][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:47:32,227][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:47:33,023][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:47:33,818][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:47:34,611][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:47:35,404][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:47:36,201][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:47:36,995][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:47:37,788][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:47:38,583][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:47:39,895][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:47:40,688][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:47:41,483][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:47:42,280][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:47:43,076][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:47:43,871][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:47:45,115][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:47:45,909][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:47:46,702][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:47:47,499][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:47:48,296][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:47:49,093][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:47:49,887][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:47:50,682][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:47:52,293][mllm.training.trainer_common][INFO] - For task: Apply reinforce step
,
            ΔVRAM Allocated: 2.288818359375e-05 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:27,
            Percentage of VRAM taken: 69.91579512747062%,
            
[2025-08-20 18:47:53,218][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/policy_optimizer_state.pt
[2025-08-20 18:47:53,219][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/critic_optimizer_state.pt
[2025-08-20 18:47:54,584][__main__][INFO] - Iteration 685 took 55s (37.41% Gen, 62.59% Train). Generation: 20s, Training: 34s. Estimated remaining time: 4h 47m 4s. Estimated total time: 15h 24m 27s. Time estimates for 10 more iterations: 9m 14s, 100 more iterations: 1h 32m 26s, 500 more iterations: 7h 42m 13s.
[2025-08-20 18:47:54,585][__main__][INFO] - Starting iteration 685.
[2025-08-20 18:48:18,063][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 18:48:18,064][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 18:48:18,070][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 18:48:20,500][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 18:48:20,501][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 18:48:20,508][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 18:48:20,510][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data.
[2025-08-20 18:48:20,510][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets.
[2025-08-20 18:48:20,810][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:48:21,603][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:48:22,394][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:48:23,187][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:48:23,981][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:48:24,773][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:48:25,567][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:48:26,364][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:48:27,157][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:48:27,948][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:48:28,744][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:48:29,539][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:48:30,332][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:48:31,125][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:48:31,922][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:48:32,716][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:48:33,508][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:48:34,303][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:48:35,099][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:48:35,894][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:48:36,688][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:48:37,484][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:48:38,282][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:48:39,078][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:48:40,413][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:48:41,208][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:48:42,004][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:48:42,801][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:48:43,599][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:48:44,396][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:48:45,192][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:48:45,987][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:48:47,600][mllm.training.trainer_common][INFO] - For task: Apply reinforce step
,
            ΔVRAM Allocated: 2.288818359375e-05 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:27,
            Percentage of VRAM taken: 69.91579512747062%,
            
[2025-08-20 18:48:48,522][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/policy_optimizer_state.pt
[2025-08-20 18:48:48,523][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/critic_optimizer_state.pt
[2025-08-20 18:48:49,768][__main__][INFO] - Iteration 686 took 55s (38.14% Gen, 61.86% Train). Generation: 21s, Training: 34s. Estimated remaining time: 4h 41m 23s. Estimated total time: 15h 19m 42s. Time estimates for 10 more iterations: 9m 11s, 100 more iterations: 1h 31m 58s, 500 more iterations: 7h 39m 51s.
[2025-08-20 18:48:49,769][__main__][INFO] - Starting iteration 686.
[2025-08-20 18:49:12,869][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 18:49:12,870][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 18:49:12,876][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 18:49:15,327][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 18:49:15,329][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 18:49:15,335][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 18:49:15,337][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data.
[2025-08-20 18:49:15,338][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets.
[2025-08-20 18:49:15,636][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:49:16,427][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:49:17,218][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:49:18,011][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:49:18,806][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:49:19,598][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:49:20,392][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:49:21,188][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:49:21,983][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:49:23,659][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:49:24,451][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:49:25,247][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:49:26,042][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:49:26,834][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:49:27,629][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:49:28,425][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:49:29,220][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:49:30,014][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:49:31,236][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:49:32,029][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:49:32,826][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:49:33,624][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:49:34,420][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:49:35,215][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:49:36,009][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:49:36,806][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:49:37,604][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:49:38,402][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:49:39,200][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:49:39,994][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:49:40,789][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:49:41,586][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:49:43,248][mllm.training.trainer_common][INFO] - For task: Apply reinforce step
,
            ΔVRAM Allocated: 2.288818359375e-05 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:27,
            Percentage of VRAM taken: 69.91579512747062%,
            
[2025-08-20 18:49:44,199][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/policy_optimizer_state.pt
[2025-08-20 18:49:44,201][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/critic_optimizer_state.pt
[2025-08-20 18:49:45,517][__main__][INFO] - Iteration 687 took 55s (37.03% Gen, 62.96% Train). Generation: 20s, Training: 35s. Estimated remaining time: 4h 49m 52s. Estimated total time: 15h 29m 7s. Time estimates for 10 more iterations: 9m 17s, 100 more iterations: 1h 32m 54s, 500 more iterations: 7h 44m 33s.
[2025-08-20 18:49:45,518][__main__][INFO] - Starting iteration 687.
[2025-08-20 18:50:08,681][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 18:50:08,682][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 18:50:08,688][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 18:50:11,137][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 18:50:11,138][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 18:50:11,144][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 18:50:11,147][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data.
[2025-08-20 18:50:11,147][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets.
[2025-08-20 18:50:11,447][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:50:12,237][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:50:13,031][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:50:13,822][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:50:14,614][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:50:15,409][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:50:16,202][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:50:16,995][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:50:17,792][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:50:18,587][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:50:19,380][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:50:20,173][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:50:20,969][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:50:21,764][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:50:22,557][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:50:23,351][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:50:24,148][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:50:24,944][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:50:25,738][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:50:26,531][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:50:27,329][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:50:28,127][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:50:28,924][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:50:30,207][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:50:31,000][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:50:31,794][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:50:32,590][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:50:33,387][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:50:34,185][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:50:34,982][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:50:35,776][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:50:36,571][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:50:38,195][mllm.training.trainer_common][INFO] - For task: Apply reinforce step
,
            ΔVRAM Allocated: 2.288818359375e-05 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:27,
            Percentage of VRAM taken: 69.91579512747062%,
            
[2025-08-20 18:50:39,098][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/policy_optimizer_state.pt
[2025-08-20 18:50:39,099][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/critic_optimizer_state.pt
[2025-08-20 18:50:40,461][__main__][INFO] - Iteration 688 took 54s (37.70% Gen, 62.30% Train). Generation: 20s, Training: 34s. Estimated remaining time: 4h 35m 32s. Estimated total time: 15h 15m 42s. Time estimates for 10 more iterations: 9m 9s, 100 more iterations: 1h 31m 34s, 500 more iterations: 7h 37m 51s.
[2025-08-20 18:50:40,462][__main__][INFO] - Starting iteration 688.
[2025-08-20 18:51:03,520][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 18:51:03,522][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 18:51:03,528][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 18:51:05,981][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 18:51:05,982][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 18:51:05,988][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 18:51:05,990][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data.
[2025-08-20 18:51:05,991][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets.
[2025-08-20 18:51:06,289][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:51:07,079][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:51:07,874][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:51:08,668][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:51:09,462][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:51:10,256][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:51:11,053][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:51:11,846][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:51:12,640][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:51:13,435][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:51:14,232][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:51:15,026][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:51:15,820][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:51:16,614][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:51:17,411][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:51:18,205][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:51:19,002][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:51:19,797][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:51:20,593][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:51:21,391][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:51:22,801][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:51:23,596][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:51:24,393][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:51:25,192][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:51:25,990][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:51:26,788][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:51:27,586][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:51:28,382][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:51:29,177][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:51:29,973][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:51:30,771][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:51:31,570][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:51:33,215][mllm.training.trainer_common][INFO] - For task: Apply reinforce step
,
            ΔVRAM Allocated: 2.288818359375e-05 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:27,
            Percentage of VRAM taken: 69.91579512747062%,
            
[2025-08-20 18:51:34,195][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/policy_optimizer_state.pt
[2025-08-20 18:51:34,196][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/critic_optimizer_state.pt
[2025-08-20 18:51:35,434][__main__][INFO] - Iteration 689 took 54s (37.50% Gen, 62.50% Train). Generation: 20s, Training: 34s. Estimated remaining time: 4h 35m 6s. Estimated total time: 15h 16m 10s. Time estimates for 10 more iterations: 9m 9s, 100 more iterations: 1h 31m 37s, 500 more iterations: 7h 38m 5s.
[2025-08-20 18:51:35,435][__main__][INFO] - Starting iteration 689.
[2025-08-20 18:51:58,567][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 18:51:58,568][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 18:51:58,574][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 18:52:01,017][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 18:52:01,018][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 18:52:01,025][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 18:52:01,027][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data.
[2025-08-20 18:52:01,027][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets.
[2025-08-20 18:52:01,325][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:52:02,114][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:52:02,904][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:52:03,696][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:52:04,486][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:52:05,277][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:52:06,069][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:52:06,861][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:52:07,655][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:52:08,448][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:52:09,239][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:52:10,034][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:52:10,826][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:52:11,618][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:52:12,412][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:52:13,205][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:52:13,998][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:52:14,793][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:52:15,587][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:52:16,379][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:52:17,174][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:52:17,971][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:52:18,764][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:52:19,557][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:52:20,869][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:52:21,665][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:52:22,460][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:52:23,255][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:52:24,054][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:52:24,852][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:52:25,651][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:52:26,449][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:52:28,060][mllm.training.trainer_common][INFO] - For task: Apply reinforce step
,
            ΔVRAM Allocated: 2.288818359375e-05 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:26,
            Percentage of VRAM taken: 69.91579512747062%,
            
[2025-08-20 18:52:28,987][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/policy_optimizer_state.pt
[2025-08-20 18:52:28,989][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/critic_optimizer_state.pt
[2025-08-20 18:52:30,296][__main__][INFO] - Iteration 690 took 54s (37.73% Gen, 62.27% Train). Generation: 20s, Training: 34s. Estimated remaining time: 4h 32m 21s. Estimated total time: 15h 14m 20s. Time estimates for 10 more iterations: 9m 8s, 100 more iterations: 1h 31m 26s, 500 more iterations: 7h 37m 10s.
[2025-08-20 18:52:30,297][__main__][INFO] - Starting iteration 690.
[2025-08-20 18:52:53,486][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 18:52:53,488][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 18:52:53,494][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 18:52:55,951][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 18:52:55,952][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 18:52:55,958][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 18:52:55,960][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data.
[2025-08-20 18:52:55,961][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets.
[2025-08-20 18:52:56,263][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:52:57,056][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:52:57,849][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:52:58,642][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:52:59,437][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:53:00,230][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:53:01,024][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:53:01,820][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:53:02,616][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:53:03,408][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:53:04,202][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:53:04,998][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:53:05,793][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:53:06,587][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:53:07,382][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:53:08,179][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:53:08,977][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:53:09,773][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:53:10,568][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:53:11,362][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:53:12,159][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:53:12,957][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:53:14,248][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:53:15,042][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:53:15,839][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:53:16,638][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:53:17,435][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:53:18,232][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:53:19,028][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:53:19,823][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:53:20,619][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:53:21,417][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:53:23,075][mllm.training.trainer_common][INFO] - For task: Apply reinforce step
,
            ΔVRAM Allocated: 2.288818359375e-05 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:27,
            Percentage of VRAM taken: 69.91579512747062%,
            
[2025-08-20 18:53:24,016][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/policy_optimizer_state.pt
[2025-08-20 18:53:24,018][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/critic_optimizer_state.pt
[2025-08-20 18:53:28,158][__main__][INFO] - Iteration 691 took 57s (35.86% Gen, 64.14% Train). Generation: 20s, Training: 37s. Estimated remaining time: 5h 21m 23s. Estimated total time: 16h 4m 20s. Time estimates for 10 more iterations: 9m 38s, 100 more iterations: 1h 36m 26s, 500 more iterations: 8h 2m 10s.
[2025-08-20 18:53:28,159][__main__][INFO] - Starting iteration 691.
[2025-08-20 18:53:51,470][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 18:53:51,471][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 18:53:51,478][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 18:53:53,947][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 18:53:53,948][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 18:53:53,954][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 18:53:53,956][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data.
[2025-08-20 18:53:53,957][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets.
[2025-08-20 18:53:54,255][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:53:55,044][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:53:55,839][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:53:56,631][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:53:57,423][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:53:58,219][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:53:59,012][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:53:59,804][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:54:00,600][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:54:01,395][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:54:02,188][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:54:02,983][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:54:03,779][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:54:04,575][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:54:05,369][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:54:06,164][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:54:06,961][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:54:07,759][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:54:08,555][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:54:09,350][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:54:10,145][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:54:10,941][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:54:11,739][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:54:12,537][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:54:13,333][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:54:14,129][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:54:15,396][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:54:16,192][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:54:16,987][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:54:17,782][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:54:18,579][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:54:19,376][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:54:21,004][mllm.training.trainer_common][INFO] - For task: Apply reinforce step
,
            ΔVRAM Allocated: 2.288818359375e-05 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:27,
            Percentage of VRAM taken: 69.91579512747062%,
            
[2025-08-20 18:54:21,926][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/policy_optimizer_state.pt
[2025-08-20 18:54:21,928][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/critic_optimizer_state.pt
[2025-08-20 18:54:23,242][__main__][INFO] - Iteration 692 took 55s (37.86% Gen, 62.14% Train). Generation: 20s, Training: 34s. Estimated remaining time: 4h 34m 9s. Estimated total time: 15h 18m 1s. Time estimates for 10 more iterations: 9m 10s, 100 more iterations: 1h 31m 48s, 500 more iterations: 7h 39m 0s.
[2025-08-20 18:54:23,243][__main__][INFO] - Starting iteration 692.
[2025-08-20 18:54:46,487][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 18:54:46,488][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 18:54:46,494][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 18:54:48,931][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 18:54:48,932][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 18:54:48,939][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 18:54:48,940][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data.
[2025-08-20 18:54:48,941][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets.
[2025-08-20 18:54:49,240][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:54:50,029][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:54:50,824][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:54:51,618][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:54:52,411][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:54:53,205][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:54:54,000][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:54:54,793][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:54:55,586][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:54:56,383][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:54:57,179][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:54:57,972][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:54:58,766][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:54:59,562][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:55:00,358][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:55:01,153][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:55:01,947][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:55:02,744][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:55:04,079][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:55:04,874][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:55:05,668][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:55:06,464][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:55:07,262][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:55:08,060][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:55:08,857][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:55:09,652][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:55:10,447][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:55:11,243][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:55:12,041][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:55:12,839][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:55:13,637][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:55:14,434][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:55:16,047][mllm.training.trainer_common][INFO] - For task: Apply reinforce step
,
            ΔVRAM Allocated: 2.288818359375e-05 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:27,
            Percentage of VRAM taken: 69.91579512747062%,
            
[2025-08-20 18:55:16,963][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/policy_optimizer_state.pt
[2025-08-20 18:55:16,964][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/critic_optimizer_state.pt
[2025-08-20 18:55:18,292][__main__][INFO] - Iteration 693 took 55s (37.80% Gen, 62.20% Train). Generation: 20s, Training: 34s. Estimated remaining time: 4h 32m 39s. Estimated total time: 15h 17m 26s. Time estimates for 10 more iterations: 9m 10s, 100 more iterations: 1h 31m 44s, 500 more iterations: 7h 38m 43s.
[2025-08-20 18:55:18,293][__main__][INFO] - Starting iteration 693.
[2025-08-20 18:55:41,775][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 18:55:41,776][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 18:55:41,783][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 18:55:44,240][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 18:55:44,241][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 18:55:44,247][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 18:55:44,249][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data.
[2025-08-20 18:55:44,250][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets.
[2025-08-20 18:55:44,548][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:55:45,336][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:55:46,127][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:55:46,918][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:55:47,710][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:55:48,506][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:55:49,298][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:55:50,090][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:55:50,885][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:55:51,678][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:55:52,470][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:55:53,265][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:55:54,060][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:55:54,853][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:55:55,647][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:55:56,443][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:55:57,238][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:55:58,032][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:55:58,827][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:55:59,624][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:56:00,421][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:56:01,214][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:56:02,010][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:56:02,806][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:56:03,604][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:56:05,024][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:56:05,821][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:56:06,620][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:56:07,415][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:56:08,211][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:56:09,006][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:56:09,804][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:56:11,419][mllm.training.trainer_common][INFO] - For task: Apply reinforce step
,
            ΔVRAM Allocated: 2.288818359375e-05 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:27,
            Percentage of VRAM taken: 69.91579512747062%,
            
[2025-08-20 18:56:12,429][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/policy_optimizer_state.pt
[2025-08-20 18:56:12,431][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/critic_optimizer_state.pt
[2025-08-20 18:56:13,706][__main__][INFO] - Iteration 694 took 55s (37.94% Gen, 62.06% Train). Generation: 21s, Training: 34s. Estimated remaining time: 4h 37m 49s. Estimated total time: 15h 23m 32s. Time estimates for 10 more iterations: 9m 14s, 100 more iterations: 1h 32m 21s, 500 more iterations: 7h 41m 46s.
[2025-08-20 18:56:13,707][__main__][INFO] - Starting iteration 694.
[2025-08-20 18:56:36,711][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 18:56:36,712][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 18:56:36,719][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 18:56:39,170][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 18:56:39,171][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 18:56:39,177][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 18:56:39,179][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data.
[2025-08-20 18:56:39,180][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets.
[2025-08-20 18:56:39,478][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:56:40,270][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:56:41,061][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:56:41,853][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:56:42,649][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:56:43,442][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:56:44,235][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:56:45,029][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:56:45,824][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:56:46,617][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:56:47,411][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:56:48,207][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:56:49,002][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:56:49,796][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:56:50,590][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:56:51,387][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:56:52,184][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:56:52,978][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:56:53,772][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:56:54,569][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:56:55,367][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:56:56,163][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:56:56,958][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:56:57,753][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:56:58,982][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:56:59,776][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:57:00,571][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:57:01,367][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:57:02,166][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:57:02,964][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:57:03,761][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:57:04,556][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:57:06,166][mllm.training.trainer_common][INFO] - For task: Apply reinforce step
,
            ΔVRAM Allocated: 2.288818359375e-05 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:26,
            Percentage of VRAM taken: 69.91579512747062%,
            
[2025-08-20 18:57:07,102][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/policy_optimizer_state.pt
[2025-08-20 18:57:07,103][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/critic_optimizer_state.pt
[2025-08-20 18:57:08,386][__main__][INFO] - Iteration 695 took 54s (37.62% Gen, 62.38% Train). Generation: 20s, Training: 34s. Estimated remaining time: 4h 24m 41s. Estimated total time: 15h 11m 18s. Time estimates for 10 more iterations: 9m 6s, 100 more iterations: 1h 31m 7s, 500 more iterations: 7h 35m 39s.
[2025-08-20 18:57:08,388][__main__][INFO] - Starting iteration 695.
[2025-08-20 18:57:31,581][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 18:57:31,582][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 18:57:31,588][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 18:57:34,063][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 18:57:34,064][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 18:57:34,070][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 18:57:34,073][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data.
[2025-08-20 18:57:34,073][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets.
[2025-08-20 18:57:34,373][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:57:35,164][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:57:35,955][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:57:36,750][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:57:37,543][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:57:38,335][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:57:39,131][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:57:39,924][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:57:40,717][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:57:41,512][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:57:42,307][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:57:43,100][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:57:43,895][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:57:44,692][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:57:45,488][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:57:46,283][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:57:47,078][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:57:47,874][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:57:48,673][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:57:49,471][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:57:50,269][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:57:51,604][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:57:52,398][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:57:53,192][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:57:53,989][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:57:54,809][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:57:55,606][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:57:56,402][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:57:57,196][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:57:57,991][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:57:58,789][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:57:59,588][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:58:01,207][mllm.training.trainer_common][INFO] - For task: Apply reinforce step
,
            ΔVRAM Allocated: 2.288818359375e-05 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:27,
            Percentage of VRAM taken: 69.91579512747062%,
            
[2025-08-20 18:58:02,155][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/policy_optimizer_state.pt
[2025-08-20 18:58:02,157][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/critic_optimizer_state.pt
[2025-08-20 18:58:03,479][__main__][INFO] - Iteration 696 took 55s (37.64% Gen, 62.36% Train). Generation: 20s, Training: 34s. Estimated remaining time: 4h 30m 38s. Estimated total time: 15h 18m 10s. Time estimates for 10 more iterations: 9m 10s, 100 more iterations: 1h 31m 49s, 500 more iterations: 7h 39m 5s.
[2025-08-20 18:58:03,481][__main__][INFO] - Starting iteration 696.
[2025-08-20 18:58:27,258][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 18:58:27,260][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 18:58:27,266][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 18:58:29,699][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 18:58:29,700][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 18:58:29,707][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 18:58:29,709][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data.
[2025-08-20 18:58:29,710][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets.
[2025-08-20 18:58:30,007][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:58:30,795][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:58:31,586][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:58:32,377][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:58:33,171][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:58:33,962][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:58:34,755][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:58:35,548][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:58:36,339][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:58:37,137][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:58:37,933][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:58:38,728][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:58:39,521][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:58:40,315][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:58:41,111][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:58:41,904][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:58:42,698][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:58:43,494][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:58:44,290][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:58:45,084][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:58:45,878][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:58:46,674][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:58:47,472][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:58:48,701][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:58:49,494][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:58:50,291][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:58:51,087][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:58:51,882][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:58:52,677][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:58:53,472][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:58:54,270][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:58:55,067][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:58:56,689][mllm.training.trainer_common][INFO] - For task: Apply reinforce step
,
            ΔVRAM Allocated: 2.288818359375e-05 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:26,
            Percentage of VRAM taken: 69.91579512747062%,
            
[2025-08-20 18:58:57,675][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/policy_optimizer_state.pt
[2025-08-20 18:58:57,676][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/critic_optimizer_state.pt
[2025-08-20 18:58:59,000][__main__][INFO] - Iteration 697 took 55s (38.42% Gen, 61.57% Train). Generation: 21s, Training: 34s. Estimated remaining time: 4h 36m 49s. Estimated total time: 15h 25m 17s. Time estimates for 10 more iterations: 9m 15s, 100 more iterations: 1h 32m 31s, 500 more iterations: 7h 42m 38s.
[2025-08-20 18:58:59,002][__main__][INFO] - Starting iteration 697.
[2025-08-20 18:59:22,075][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 18:59:22,076][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 18:59:22,083][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 18:59:24,520][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 18:59:24,521][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 18:59:24,527][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 18:59:24,529][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data.
[2025-08-20 18:59:24,530][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets.
[2025-08-20 18:59:24,829][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:59:25,619][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:59:26,413][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:59:27,206][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:59:28,000][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:59:28,795][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:59:29,589][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:59:30,381][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:59:31,177][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:59:31,971][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:59:32,765][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:59:33,560][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:59:34,356][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:59:35,152][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:59:35,947][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:59:36,741][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:59:37,538][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:59:38,336][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:59:39,134][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:59:39,930][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:59:40,725][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:59:41,520][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:59:42,317][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:59:43,609][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:59:44,404][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:59:45,198][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:59:45,996][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:59:46,794][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:59:47,593][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:59:48,390][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:59:49,185][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:59:49,980][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 18:59:51,618][mllm.training.trainer_common][INFO] - For task: Apply reinforce step
,
            ΔVRAM Allocated: 2.288818359375e-05 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:27,
            Percentage of VRAM taken: 69.91579512747062%,
            
[2025-08-20 18:59:52,574][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/policy_optimizer_state.pt
[2025-08-20 18:59:52,576][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/critic_optimizer_state.pt
[2025-08-20 18:59:53,929][__main__][INFO] - Iteration 698 took 54s (37.57% Gen, 62.43% Train). Generation: 20s, Training: 34s. Estimated remaining time: 4h 26m 4s. Estimated total time: 15h 15m 27s. Time estimates for 10 more iterations: 9m 9s, 100 more iterations: 1h 31m 32s, 500 more iterations: 7h 37m 43s.
[2025-08-20 18:59:53,931][__main__][INFO] - Starting iteration 698.
[2025-08-20 19:00:17,061][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 19:00:17,062][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 19:00:17,069][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 19:00:19,525][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 19:00:19,526][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 19:00:19,533][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 19:00:19,535][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data.
[2025-08-20 19:00:19,536][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets.
[2025-08-20 19:00:19,834][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:00:20,624][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:00:21,418][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:00:22,210][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:00:23,003][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:00:23,798][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:00:24,591][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:00:25,383][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:00:26,180][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:00:26,975][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:00:27,768][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:00:28,562][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:00:29,359][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:00:30,155][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:00:30,949][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:00:31,744][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:00:32,540][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:00:33,338][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:00:34,135][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:00:34,931][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:00:35,725][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:00:36,521][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:00:37,879][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:00:38,677][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:00:39,474][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:00:40,268][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:00:41,063][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:00:41,859][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:00:42,657][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:00:43,454][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:00:44,251][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:00:45,046][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:00:46,684][mllm.training.trainer_common][INFO] - For task: Apply reinforce step
,
            ΔVRAM Allocated: 2.288818359375e-05 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:27,
            Percentage of VRAM taken: 69.91579512747062%,
            
[2025-08-20 19:00:47,659][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/policy_optimizer_state.pt
[2025-08-20 19:00:47,660][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/critic_optimizer_state.pt
[2025-08-20 19:00:48,958][__main__][INFO] - Iteration 699 took 55s (37.54% Gen, 62.46% Train). Generation: 20s, Training: 34s. Estimated remaining time: 4h 26m 49s. Estimated total time: 15h 17m 7s. Time estimates for 10 more iterations: 9m 10s, 100 more iterations: 1h 31m 42s, 500 more iterations: 7h 38m 33s.
[2025-08-20 19:00:48,960][__main__][INFO] - Starting iteration 699.
[2025-08-20 19:01:12,008][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 19:01:12,010][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 19:01:12,016][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 19:01:14,470][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 19:01:14,471][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 19:01:14,478][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 19:01:14,480][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data.
[2025-08-20 19:01:14,481][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets.
[2025-08-20 19:01:14,778][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:01:15,568][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:01:16,363][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:01:17,155][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:01:17,948][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:01:18,743][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:01:19,536][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:01:20,328][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:01:21,125][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:01:21,921][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:01:22,714][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:01:23,507][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:01:24,302][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:01:25,100][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:01:25,894][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:01:26,688][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:01:27,484][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:01:28,282][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:01:29,081][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:01:29,879][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:01:30,675][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:01:31,470][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:01:32,714][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:01:33,508][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:01:34,304][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:01:35,102][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:01:35,900][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:01:36,698][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:01:37,495][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:01:38,290][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:01:39,086][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:01:39,882][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:01:41,510][mllm.training.trainer_common][INFO] - For task: Apply reinforce step
,
            ΔVRAM Allocated: 2.288818359375e-05 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:26,
            Percentage of VRAM taken: 69.91579512747062%,
            
[2025-08-20 19:01:42,485][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/policy_optimizer_state.pt
[2025-08-20 19:01:42,486][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/critic_optimizer_state.pt
[2025-08-20 19:01:43,836][__main__][INFO] - Iteration 700 took 54s (37.53% Gen, 62.47% Train). Generation: 20s, Training: 34s. Estimated remaining time: 4h 23m 23s. Estimated total time: 15h 14m 36s. Time estimates for 10 more iterations: 9m 8s, 100 more iterations: 1h 31m 27s, 500 more iterations: 7h 37m 18s.
[2025-08-20 19:01:43,839][__main__][INFO] - Starting iteration 700.
[2025-08-20 19:02:07,020][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 19:02:07,022][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 19:02:07,028][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 19:02:09,495][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 19:02:09,496][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 19:02:09,503][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 19:02:09,505][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data.
[2025-08-20 19:02:09,506][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets.
[2025-08-20 19:02:09,806][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:02:10,597][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:02:11,389][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:02:12,184][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:02:12,976][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:02:13,769][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:02:14,566][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:02:15,360][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:02:16,154][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:02:16,948][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:02:17,744][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:02:18,539][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:02:19,332][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:02:20,128][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:02:20,925][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:02:21,721][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:02:22,516][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:02:23,311][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:02:24,107][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:02:24,903][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:02:25,698][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:02:26,492][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:02:27,848][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:02:28,645][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:02:29,443][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:02:30,240][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:02:31,035][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:02:31,830][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:02:32,625][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:02:33,423][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:02:34,221][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:02:35,018][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:02:36,638][mllm.training.trainer_common][INFO] - For task: Apply reinforce step
,
            ΔVRAM Allocated: 2.288818359375e-05 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:27,
            Percentage of VRAM taken: 69.91579512747062%,
            
[2025-08-20 19:02:37,603][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/policy_optimizer_state.pt
[2025-08-20 19:02:37,604][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/critic_optimizer_state.pt
[2025-08-20 19:02:41,939][__main__][INFO] - Iteration 701 took 58s (35.64% Gen, 59.12% Train). Generation: 20s, Training: 34s. Estimated remaining time: 5h 16m 9s. Estimated total time: 16h 8m 20s. Time estimates for 10 more iterations: 9m 41s, 100 more iterations: 1h 36m 50s, 500 more iterations: 8h 4m 10s.
[2025-08-20 19:02:41,941][__main__][INFO] - Starting iteration 701.
[2025-08-20 19:03:05,290][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 19:03:05,291][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 19:03:05,298][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 19:03:07,729][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 19:03:07,730][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 19:03:07,737][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 19:03:07,739][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data.
[2025-08-20 19:03:07,739][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets.
[2025-08-20 19:03:08,037][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:03:08,827][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:03:09,618][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:03:10,411][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:03:11,207][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:03:11,999][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:03:12,792][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:03:13,589][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:03:14,384][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:03:15,178][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:03:15,971][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:03:16,768][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:03:17,565][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:03:18,359][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:03:19,153][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:03:19,949][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:03:20,747][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:03:21,544][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:03:22,857][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:03:23,650][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:03:24,447][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:03:25,244][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:03:26,041][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:03:26,835][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:03:27,629][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:03:28,426][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:03:29,225][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:03:30,022][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:03:30,819][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:03:31,614][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:03:32,409][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:03:33,206][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:03:34,799][mllm.training.trainer_common][INFO] - For task: Apply reinforce step
,
            ΔVRAM Allocated: 2.288818359375e-05 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:27,
            Percentage of VRAM taken: 69.91579512747062%,
            
[2025-08-20 19:03:35,798][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/policy_optimizer_state.pt
[2025-08-20 19:03:35,799][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/critic_optimizer_state.pt
[2025-08-20 19:03:37,112][__main__][INFO] - Iteration 702 took 55s (37.92% Gen, 62.08% Train). Generation: 20s, Training: 34s. Estimated remaining time: 4h 26m 24s. Estimated total time: 15h 19m 30s. Time estimates for 10 more iterations: 9m 11s, 100 more iterations: 1h 31m 57s, 500 more iterations: 7h 39m 45s.
[2025-08-20 19:03:37,114][__main__][INFO] - Starting iteration 702.
[2025-08-20 19:04:00,406][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 19:04:00,408][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 19:04:00,414][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 19:04:02,857][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 19:04:02,858][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 19:04:02,865][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 19:04:02,867][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data.
[2025-08-20 19:04:02,867][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets.
[2025-08-20 19:04:03,166][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:04:03,957][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:04:04,752][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:04:05,547][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:04:06,339][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:04:07,133][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:04:07,931][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:04:08,726][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:04:09,520][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:04:10,313][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:04:11,110][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:04:11,906][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:04:12,700][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:04:13,494][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:04:14,291][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:04:15,088][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:04:16,389][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:04:17,183][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:04:17,977][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:04:18,771][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:04:19,569][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:04:20,364][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:04:21,159][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:04:21,953][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:04:22,749][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:04:23,548][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:04:24,345][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:04:25,142][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:04:25,936][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:04:26,732][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:04:27,528][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:04:28,326][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:04:29,905][mllm.training.trainer_common][INFO] - For task: Apply reinforce step
,
            ΔVRAM Allocated: 2.288818359375e-05 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:27,
            Percentage of VRAM taken: 69.91579512747062%,
            
[2025-08-20 19:04:30,863][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/policy_optimizer_state.pt
[2025-08-20 19:04:30,865][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/critic_optimizer_state.pt
[2025-08-20 19:04:32,132][__main__][INFO] - Iteration 703 took 55s (37.92% Gen, 62.07% Train). Generation: 20s, Training: 34s. Estimated remaining time: 4h 22m 56s. Estimated total time: 15h 16m 57s. Time estimates for 10 more iterations: 9m 10s, 100 more iterations: 1h 31m 41s, 500 more iterations: 7h 38m 28s.
[2025-08-20 19:04:32,134][__main__][INFO] - Starting iteration 703.
[2025-08-20 19:04:55,265][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 19:04:55,266][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 19:04:55,273][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 19:04:57,724][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 19:04:57,725][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 19:04:57,731][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 19:04:57,733][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data.
[2025-08-20 19:04:57,734][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets.
[2025-08-20 19:04:58,032][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:04:58,823][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:04:59,617][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:05:00,413][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:05:01,207][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:05:02,000][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:05:02,797][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:05:03,595][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:05:04,392][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:05:05,186][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:05:05,980][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:05:06,776][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:05:07,575][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:05:08,373][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:05:09,170][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:05:09,965][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:05:10,760][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:05:11,556][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:05:12,354][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:05:13,153][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:05:13,952][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:05:14,749][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:05:15,546][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:05:16,341][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:05:17,137][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:05:18,521][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:05:19,316][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:05:20,114][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:05:20,912][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:05:21,710][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:05:22,508][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:05:23,303][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:05:24,898][mllm.training.trainer_common][INFO] - For task: Apply reinforce step
,
            ΔVRAM Allocated: 2.288818359375e-05 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:27,
            Percentage of VRAM taken: 69.91579512747062%,
            
[2025-08-20 19:05:25,834][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/policy_optimizer_state.pt
[2025-08-20 19:05:25,836][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/critic_optimizer_state.pt
[2025-08-20 19:05:28,209][__main__][INFO] - Iteration 704 took 56s (36.89% Gen, 63.11% Train). Generation: 20s, Training: 35s. Estimated remaining time: 4h 39m 38s. Estimated total time: 15h 34m 35s. Time estimates for 10 more iterations: 9m 20s, 100 more iterations: 1h 33m 27s, 500 more iterations: 7h 47m 17s.
[2025-08-20 19:05:28,211][__main__][INFO] - Starting iteration 704.
[2025-08-20 19:05:51,238][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 19:05:51,240][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 19:05:51,246][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 19:05:53,708][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 19:05:53,709][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 19:05:53,716][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 19:05:53,718][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data.
[2025-08-20 19:05:53,719][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets.
[2025-08-20 19:05:54,017][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:05:54,807][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:05:55,600][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:05:56,396][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:05:57,191][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:05:57,984][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:05:58,779][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:06:00,833][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:06:01,628][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:06:02,421][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:06:03,218][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:06:04,015][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:06:04,811][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:06:05,605][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:06:06,399][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:06:07,196][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:06:07,994][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:06:08,789][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:06:09,583][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:06:10,378][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:06:11,694][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:06:12,492][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:06:13,288][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:06:14,082][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:06:14,876][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:06:15,674][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:06:16,472][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:06:17,268][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:06:18,061][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:06:18,854][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:06:19,651][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:06:20,446][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:06:22,034][mllm.training.trainer_common][INFO] - For task: Apply reinforce step
,
            ΔVRAM Allocated: 2.288818359375e-05 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:28,
            Percentage of VRAM taken: 69.91579512747062%,
            
[2025-08-20 19:06:22,944][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/policy_optimizer_state.pt
[2025-08-20 19:06:22,946][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/critic_optimizer_state.pt
[2025-08-20 19:06:24,291][__main__][INFO] - Iteration 705 took 56s (36.68% Gen, 63.32% Train). Generation: 20s, Training: 35s. Estimated remaining time: 4h 38m 46s. Estimated total time: 15h 34m 39s. Time estimates for 10 more iterations: 9m 20s, 100 more iterations: 1h 33m 27s, 500 more iterations: 7h 47m 19s.
[2025-08-20 19:06:24,292][__main__][INFO] - Starting iteration 705.
[2025-08-20 19:06:47,553][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 19:06:47,554][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 19:06:47,560][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 19:06:50,028][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 19:06:50,029][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 19:06:50,036][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 19:06:50,038][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data.
[2025-08-20 19:06:50,038][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets.
[2025-08-20 19:06:50,338][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:06:51,129][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:06:51,923][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:06:52,719][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:06:53,514][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:06:54,308][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:06:55,102][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:06:55,899][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:06:56,696][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:06:57,490][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:06:58,285][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:06:59,079][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:06:59,876][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:07:00,671][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:07:01,466][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:07:02,261][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:07:03,059][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:07:03,856][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:07:04,653][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:07:05,447][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:07:06,745][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:07:07,538][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:07:08,335][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:07:09,131][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:07:09,925][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:07:10,718][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:07:11,515][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:07:12,310][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:07:13,102][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:07:13,895][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:07:14,691][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:07:15,483][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:07:17,062][mllm.training.trainer_common][INFO] - For task: Apply reinforce step
,
            ΔVRAM Allocated: 2.288818359375e-05 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:26,
            Percentage of VRAM taken: 69.91579512747062%,
            
[2025-08-20 19:07:18,058][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/policy_optimizer_state.pt
[2025-08-20 19:07:18,060][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/critic_optimizer_state.pt
[2025-08-20 19:07:29,533][__main__][INFO] - Iteration 706 took 1m 5s (31.89% Gen, 68.11% Train). Generation: 20s, Training: 44s. Estimated remaining time: 7h 10m 21s. Estimated total time: 18h 7m 19s. Time estimates for 10 more iterations: 10m 52s, 100 more iterations: 1h 48m 43s, 500 more iterations: 9h 3m 39s.
[2025-08-20 19:07:29,534][__main__][INFO] - Starting iteration 706.
[2025-08-20 19:07:52,985][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 19:07:52,987][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 19:07:52,993][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 19:07:55,453][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 19:07:55,455][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 19:07:55,462][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 19:07:55,464][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data.
[2025-08-20 19:07:55,465][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets.
[2025-08-20 19:07:55,763][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:07:56,555][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:07:57,346][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:07:58,138][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:07:58,930][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:07:59,721][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:08:00,517][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:08:01,310][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:08:02,103][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:08:02,898][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:08:03,693][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:08:04,486][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:08:05,280][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:08:06,491][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:08:07,283][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:08:08,075][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:08:08,867][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:08:09,658][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:08:10,453][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:08:11,245][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:08:12,037][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:08:12,832][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:08:13,624][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:08:14,416][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:08:15,212][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:08:16,005][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:08:16,797][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:08:17,593][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:08:18,387][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:08:19,180][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:08:19,974][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:08:20,769][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:08:22,421][mllm.training.trainer_common][INFO] - For task: Apply reinforce step
,
            ΔVRAM Allocated: 2.288818359375e-05 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:26,
            Percentage of VRAM taken: 69.91579512747062%,
            
[2025-08-20 19:08:23,332][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/policy_optimizer_state.pt
[2025-08-20 19:08:23,333][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/critic_optimizer_state.pt
[2025-08-20 19:08:24,796][__main__][INFO] - Iteration 707 took 55s (37.97% Gen, 62.03% Train). Generation: 20s, Training: 34s. Estimated remaining time: 4h 23m 7s. Estimated total time: 15h 21m 0s. Time estimates for 10 more iterations: 9m 12s, 100 more iterations: 1h 32m 6s, 500 more iterations: 7h 40m 30s.
[2025-08-20 19:08:24,797][__main__][INFO] - Starting iteration 707.
[2025-08-20 19:08:48,997][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 19:08:48,998][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 19:08:49,004][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 19:08:51,466][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 19:08:51,467][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 19:08:51,474][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 19:08:51,476][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data.
[2025-08-20 19:08:51,477][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets.
[2025-08-20 19:08:51,783][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:08:52,576][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:08:53,368][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:08:54,162][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:08:54,958][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:08:55,750][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:08:56,544][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:08:57,341][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:08:58,136][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:08:58,929][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:08:59,724][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:09:00,520][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:09:01,315][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:09:02,107][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:09:02,901][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:09:03,695][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:09:04,487][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:09:05,280][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:09:06,076][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:09:06,870][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:09:07,662][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:09:08,458][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:09:09,689][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:09:10,482][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:09:11,277][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:09:12,073][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:09:12,866][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:09:13,660][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:09:14,456][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:09:15,252][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:09:16,045][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:09:16,839][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:09:18,405][mllm.training.trainer_common][INFO] - For task: Apply reinforce step
,
            ΔVRAM Allocated: 2.288818359375e-05 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:26,
            Percentage of VRAM taken: 69.91579512747062%,
            
[2025-08-20 19:09:19,314][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/policy_optimizer_state.pt
[2025-08-20 19:09:19,315][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/critic_optimizer_state.pt
[2025-08-20 19:09:20,785][__main__][INFO] - Iteration 708 took 55s (38.84% Gen, 61.16% Train). Generation: 21s, Training: 34s. Estimated remaining time: 4h 34m 18s. Estimated total time: 15h 33m 7s. Time estimates for 10 more iterations: 9m 19s, 100 more iterations: 1h 33m 18s, 500 more iterations: 7h 46m 33s.
[2025-08-20 19:09:20,787][__main__][INFO] - Starting iteration 708.
[2025-08-20 19:09:44,065][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 19:09:44,066][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 19:09:44,073][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 19:09:46,550][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 19:09:46,551][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 19:09:46,558][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 19:09:46,560][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data.
[2025-08-20 19:09:46,560][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets.
[2025-08-20 19:09:46,862][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:09:47,652][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:09:48,446][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:09:49,241][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:09:50,034][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:09:50,827][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:09:51,625][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:09:52,420][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:09:53,213][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:09:54,007][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:09:54,804][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:09:55,599][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:09:56,391][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:09:57,184][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:09:57,979][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:09:58,772][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:09:59,565][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:10:00,361][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:10:01,634][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:10:02,428][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:10:03,223][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:10:04,019][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:10:04,814][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:10:05,607][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:10:06,401][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:10:07,198][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:10:07,992][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:10:08,785][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:10:09,580][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:10:10,377][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:10:11,172][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:10:11,965][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:10:13,565][mllm.training.trainer_common][INFO] - For task: Apply reinforce step
,
            ΔVRAM Allocated: 2.288818359375e-05 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:26,
            Percentage of VRAM taken: 69.91579512747062%,
            
[2025-08-20 19:10:14,529][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/policy_optimizer_state.pt
[2025-08-20 19:10:14,531][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/critic_optimizer_state.pt
[2025-08-20 19:10:15,840][__main__][INFO] - Iteration 709 took 55s (37.81% Gen, 62.19% Train). Generation: 20s, Training: 34s. Estimated remaining time: 4h 17m 48s. Estimated total time: 15h 17m 33s. Time estimates for 10 more iterations: 9m 10s, 100 more iterations: 1h 31m 45s, 500 more iterations: 7h 38m 46s.
[2025-08-20 19:10:15,842][__main__][INFO] - Starting iteration 709.
[2025-08-20 19:10:39,034][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 19:10:39,035][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 19:10:39,041][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 19:10:41,503][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 19:10:41,504][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 19:10:41,512][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 19:10:41,514][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data.
[2025-08-20 19:10:41,515][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets.
[2025-08-20 19:10:41,814][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:10:42,607][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:10:43,400][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:10:44,193][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:10:44,989][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:10:45,786][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:10:46,581][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:10:47,374][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:10:48,169][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:10:48,966][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:10:49,762][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:10:50,555][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:10:51,348][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:10:52,144][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:10:52,938][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:10:53,731][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:10:54,525][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:10:55,321][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:10:56,114][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:10:56,908][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:10:58,149][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:10:58,943][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:10:59,738][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:11:00,530][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:11:01,324][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:11:02,121][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:11:02,915][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:11:03,707][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:11:04,502][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:11:05,297][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:11:06,090][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:11:06,883][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:11:08,514][mllm.training.trainer_common][INFO] - For task: Apply reinforce step
,
            ΔVRAM Allocated: 2.288818359375e-05 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:26,
            Percentage of VRAM taken: 69.91579512747062%,
            
[2025-08-20 19:11:09,438][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/policy_optimizer_state.pt
[2025-08-20 19:11:09,439][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/critic_optimizer_state.pt
[2025-08-20 19:11:10,844][__main__][INFO] - Iteration 710 took 55s (37.72% Gen, 62.28% Train). Generation: 20s, Training: 34s. Estimated remaining time: 4h 16m 1s. Estimated total time: 15h 16m 41s. Time estimates for 10 more iterations: 9m 10s, 100 more iterations: 1h 31m 40s, 500 more iterations: 7h 38m 20s.
[2025-08-20 19:11:10,845][__main__][INFO] - Starting iteration 710.
[2025-08-20 19:11:34,641][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 19:11:34,643][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 19:11:34,649][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 19:11:37,070][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 19:11:37,072][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 19:11:37,078][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 19:11:37,080][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data.
[2025-08-20 19:11:37,081][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets.
[2025-08-20 19:11:37,380][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:11:38,170][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:11:38,963][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:11:39,757][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:11:40,552][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:11:41,348][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:11:42,142][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:11:42,935][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:11:43,731][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:11:44,528][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:11:45,321][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:11:46,113][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:11:46,909][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:11:47,703][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:11:48,496][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:11:49,289][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:11:50,086][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:11:50,880][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:11:51,674][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:11:52,469][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:11:53,263][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:11:54,497][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:11:55,291][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:11:56,086][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:11:56,883][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:11:57,677][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:11:58,470][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:11:59,266][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:12:00,064][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:12:00,859][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:12:01,653][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:12:02,447][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:12:04,019][mllm.training.trainer_common][INFO] - For task: Apply reinforce step
,
            ΔVRAM Allocated: 2.288818359375e-05 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:26,
            Percentage of VRAM taken: 69.91579512747062%,
            
[2025-08-20 19:12:04,995][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/policy_optimizer_state.pt
[2025-08-20 19:12:04,996][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/critic_optimizer_state.pt
[2025-08-20 19:12:06,317][__main__][INFO] - Iteration 711 took 55s (38.52% Gen, 61.48% Train). Generation: 21s, Training: 34s. Estimated remaining time: 4h 22m 56s. Estimated total time: 15h 24m 31s. Time estimates for 10 more iterations: 9m 14s, 100 more iterations: 1h 32m 27s, 500 more iterations: 7h 42m 15s.
[2025-08-20 19:12:06,319][__main__][INFO] - Starting iteration 711.
[2025-08-20 19:12:30,039][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 19:12:30,040][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 19:12:30,046][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 19:12:32,523][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 19:12:32,525][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 19:12:32,533][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 19:12:32,535][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data.
[2025-08-20 19:12:32,536][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets.
[2025-08-20 19:12:32,834][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:12:33,627][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:12:34,421][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:12:35,215][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:12:36,012][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:12:36,808][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:12:37,602][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:12:38,396][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:12:39,192][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:12:39,987][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:12:40,780][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:12:41,573][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:12:42,369][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:12:43,161][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:12:43,954][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:12:44,750][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:12:45,544][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:12:46,338][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:12:47,132][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:12:47,928][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:12:48,722][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:12:49,515][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:12:50,311][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:12:51,108][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:12:51,904][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:12:52,697][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:12:54,000][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:12:54,794][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:12:55,589][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:12:56,386][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:12:57,182][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:12:57,976][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:12:59,551][mllm.training.trainer_common][INFO] - For task: Apply reinforce step
,
            ΔVRAM Allocated: 2.288818359375e-05 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:26,
            Percentage of VRAM taken: 69.91579512747062%,
            
[2025-08-20 19:13:00,483][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/policy_optimizer_state.pt
[2025-08-20 19:13:00,484][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/critic_optimizer_state.pt
[2025-08-20 19:13:01,762][__main__][INFO] - Iteration 712 took 55s (38.28% Gen, 61.72% Train). Generation: 21s, Training: 34s. Estimated remaining time: 4h 21m 32s. Estimated total time: 15h 24m 2s. Time estimates for 10 more iterations: 9m 14s, 100 more iterations: 1h 32m 24s, 500 more iterations: 7h 42m 1s.
[2025-08-20 19:13:01,764][__main__][INFO] - Starting iteration 712.
[2025-08-20 19:13:25,199][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 19:13:25,200][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 19:13:25,206][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 19:13:27,653][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 19:13:27,654][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 19:13:27,661][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 19:13:27,663][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data.
[2025-08-20 19:13:27,664][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets.
[2025-08-20 19:13:27,962][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:13:28,754][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:13:29,547][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:13:30,339][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:13:31,136][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:13:31,931][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:13:32,724][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:13:33,518][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:13:34,314][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:13:35,107][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:13:35,899][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:13:36,692][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:13:37,485][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:13:38,277][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:13:39,073][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:13:39,869][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:13:40,662][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:13:41,455][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:13:42,252][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:13:43,048][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:13:43,841][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:13:44,635][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:13:45,951][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:13:46,747][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:13:47,540][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:13:48,334][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:13:49,130][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:13:49,926][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:13:50,719][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:13:51,513][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:13:52,310][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:13:53,107][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:13:54,669][mllm.training.trainer_common][INFO] - For task: Apply reinforce step
,
            ΔVRAM Allocated: 2.288818359375e-05 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:26,
            Percentage of VRAM taken: 69.91579512747062%,
            
[2025-08-20 19:13:55,587][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/policy_optimizer_state.pt
[2025-08-20 19:13:55,589][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/critic_optimizer_state.pt
[2025-08-20 19:13:57,222][__main__][INFO] - Iteration 713 took 55s (37.84% Gen, 62.15% Train). Generation: 20s, Training: 34s. Estimated remaining time: 4h 20m 51s. Estimated total time: 15h 24m 17s. Time estimates for 10 more iterations: 9m 14s, 100 more iterations: 1h 32m 25s, 500 more iterations: 7h 42m 8s.
[2025-08-20 19:13:57,223][__main__][INFO] - Starting iteration 713.
[2025-08-20 19:14:20,425][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 19:14:20,427][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 19:14:20,433][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 19:14:22,888][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 19:14:22,889][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 19:14:22,896][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 19:14:22,898][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data.
[2025-08-20 19:14:22,898][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets.
[2025-08-20 19:14:23,197][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:14:23,985][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:14:24,776][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:14:25,567][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:14:26,359][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:14:27,153][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:14:27,944][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:14:28,737][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:14:29,530][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:14:30,319][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:14:31,114][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:14:31,904][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:14:32,696][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:14:33,488][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:14:34,280][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:14:35,074][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:14:35,866][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:14:36,657][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:14:37,452][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:14:38,243][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:14:39,555][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:14:40,350][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:14:41,144][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:14:41,937][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:14:42,734][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:14:43,530][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:14:44,323][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:14:45,117][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:14:45,913][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:14:46,710][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:14:47,504][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:14:48,297][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:14:49,895][mllm.training.trainer_common][INFO] - For task: Apply reinforce step
,
            ΔVRAM Allocated: 2.288818359375e-05 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:26,
            Percentage of VRAM taken: 69.91579512747062%,
            
[2025-08-20 19:14:50,878][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/policy_optimizer_state.pt
[2025-08-20 19:14:50,880][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/critic_optimizer_state.pt
[2025-08-20 19:14:52,442][__main__][INFO] - Iteration 714 took 55s (37.59% Gen, 62.41% Train). Generation: 20s, Training: 34s. Estimated remaining time: 4h 15m 57s. Estimated total time: 15h 20m 18s. Time estimates for 10 more iterations: 9m 12s, 100 more iterations: 1h 32m 1s, 500 more iterations: 7h 40m 9s.
[2025-08-20 19:14:52,444][__main__][INFO] - Starting iteration 714.
[2025-08-20 19:15:15,932][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 19:15:15,933][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 19:15:15,940][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 19:15:18,405][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 19:15:18,407][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 19:15:18,413][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 19:15:18,415][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data.
[2025-08-20 19:15:18,416][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets.
[2025-08-20 19:15:18,716][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:15:19,507][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:15:20,301][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:15:21,096][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:15:21,889][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:15:22,684][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:15:23,477][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:15:24,270][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:15:25,062][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:15:25,857][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:15:26,650][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:15:27,442][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:15:28,239][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:15:29,034][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:15:29,828][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:15:30,621][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:15:31,418][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:15:32,213][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:15:33,007][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:15:33,800][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:15:34,597][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:15:35,392][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:15:36,186][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:15:36,980][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:15:37,777][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:15:39,099][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:15:39,897][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:15:41,303][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:15:42,096][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:15:42,893][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:15:43,687][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:15:44,480][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:15:46,076][mllm.training.trainer_common][INFO] - For task: Apply reinforce step
,
            ΔVRAM Allocated: 2.288818359375e-05 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:27,
            Percentage of VRAM taken: 69.91579512747062%,
            
[2025-08-20 19:15:46,957][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/policy_optimizer_state.pt
[2025-08-20 19:15:46,958][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/critic_optimizer_state.pt
[2025-08-20 19:15:48,273][__main__][INFO] - Iteration 715 took 55s (37.69% Gen, 62.31% Train). Generation: 21s, Training: 34s. Estimated remaining time: 4h 25m 11s. Estimated total time: 15h 30m 28s. Time estimates for 10 more iterations: 9m 18s, 100 more iterations: 1h 33m 2s, 500 more iterations: 7h 45m 14s.
[2025-08-20 19:15:48,274][__main__][INFO] - Starting iteration 715.
[2025-08-20 19:16:11,399][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 19:16:11,400][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 19:16:11,407][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 19:16:13,865][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 19:16:13,866][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 19:16:13,873][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 19:16:13,875][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data.
[2025-08-20 19:16:13,876][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets.
[2025-08-20 19:16:14,173][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:16:14,964][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:16:15,756][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:16:16,549][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:16:17,344][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:16:18,139][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:16:18,930][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:16:19,723][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:16:20,516][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:16:21,308][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:16:22,102][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:16:22,895][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:16:23,688][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:16:24,482][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:16:25,276][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:16:26,069][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:16:26,863][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:16:27,659][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:16:28,452][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:16:29,245][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:16:30,041][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:16:31,246][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:16:32,039][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:16:32,833][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:16:33,626][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:16:34,420][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:16:35,215][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:16:36,008][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:16:36,801][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:16:37,597][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:16:38,392][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:16:39,186][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:16:40,780][mllm.training.trainer_common][INFO] - For task: Apply reinforce step
,
            ΔVRAM Allocated: 2.288818359375e-05 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:26,
            Percentage of VRAM taken: 69.91579512747062%,
            
[2025-08-20 19:16:41,727][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/policy_optimizer_state.pt
[2025-08-20 19:16:41,728][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/critic_optimizer_state.pt
[2025-08-20 19:16:43,029][__main__][INFO] - Iteration 716 took 54s (37.76% Gen, 62.24% Train). Generation: 20s, Training: 34s. Estimated remaining time: 4h 6m 21s. Estimated total time: 15h 12m 33s. Time estimates for 10 more iterations: 9m 7s, 100 more iterations: 1h 31m 15s, 500 more iterations: 7h 36m 16s.
[2025-08-20 19:16:43,030][__main__][INFO] - Starting iteration 716.
[2025-08-20 19:17:06,329][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 19:17:06,330][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 19:17:06,337][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 19:17:08,825][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 19:17:08,827][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 19:17:08,833][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 19:17:08,835][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data.
[2025-08-20 19:17:08,836][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets.
[2025-08-20 19:17:09,136][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:17:09,928][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:17:10,722][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:17:11,515][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:17:12,310][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:17:13,106][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:17:13,900][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:17:14,692][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:17:15,486][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:17:16,281][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:17:17,074][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:17:17,868][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:17:18,662][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:17:19,455][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:17:20,248][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:17:21,044][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:17:21,839][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:17:22,632][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:17:23,426][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:17:24,696][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:17:25,489][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:17:26,284][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:17:27,082][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:17:27,877][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:17:28,672][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:17:29,466][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:17:30,262][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:17:31,057][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:17:31,851][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:17:32,645][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:17:33,443][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:17:34,240][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:17:35,820][mllm.training.trainer_common][INFO] - For task: Apply reinforce step
,
            ΔVRAM Allocated: 2.288818359375e-05 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:26,
            Percentage of VRAM taken: 69.91579512747062%,
            
[2025-08-20 19:17:36,740][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/policy_optimizer_state.pt
[2025-08-20 19:17:36,742][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/critic_optimizer_state.pt
[2025-08-20 19:17:38,100][__main__][INFO] - Iteration 717 took 55s (37.81% Gen, 62.19% Train). Generation: 20s, Training: 34s. Estimated remaining time: 4h 10m 42s. Estimated total time: 15h 17m 49s. Time estimates for 10 more iterations: 9m 10s, 100 more iterations: 1h 31m 46s, 500 more iterations: 7h 38m 54s.
[2025-08-20 19:17:38,101][__main__][INFO] - Starting iteration 717.
[2025-08-20 19:18:01,575][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 19:18:01,576][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 19:18:01,582][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 19:18:04,051][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 19:18:04,052][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 19:18:04,059][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 19:18:04,061][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data.
[2025-08-20 19:18:04,062][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets.
[2025-08-20 19:18:04,360][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:18:05,151][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:18:05,947][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:18:06,740][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:18:07,533][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:18:08,328][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:18:09,121][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:18:09,913][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:18:10,708][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:18:11,501][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:18:12,293][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:18:13,089][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:18:13,885][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:18:14,678][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:18:15,472][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:18:16,267][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:18:17,062][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:18:17,856][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:18:18,649][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:18:19,446][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:18:20,241][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:18:21,035][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:18:21,828][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:18:22,623][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:18:23,910][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:18:24,704][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:18:25,497][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:18:26,290][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:18:27,083][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:18:27,877][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:18:28,669][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:18:29,465][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:18:31,096][mllm.training.trainer_common][INFO] - For task: Apply reinforce step
,
            ΔVRAM Allocated: 2.288818359375e-05 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:26,
            Percentage of VRAM taken: 69.91579512747062%,
            
[2025-08-20 19:18:32,054][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/policy_optimizer_state.pt
[2025-08-20 19:18:32,056][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/critic_optimizer_state.pt
[2025-08-20 19:18:33,328][__main__][INFO] - Iteration 718 took 55s (38.08% Gen, 61.92% Train). Generation: 21s, Training: 34s. Estimated remaining time: 4h 12m 24s. Estimated total time: 15h 20m 26s. Time estimates for 10 more iterations: 9m 12s, 100 more iterations: 1h 32m 2s, 500 more iterations: 7h 40m 13s.
[2025-08-20 19:18:33,330][__main__][INFO] - Starting iteration 718.
[2025-08-20 19:18:56,633][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 19:18:56,635][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 19:18:56,641][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 19:18:59,111][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 19:18:59,113][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 19:18:59,119][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 19:18:59,122][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data.
[2025-08-20 19:18:59,122][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets.
[2025-08-20 19:18:59,420][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:19:00,211][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:19:01,002][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:19:01,794][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:19:02,589][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:19:03,380][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:19:04,172][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:19:04,964][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:19:05,755][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:19:06,550][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:19:07,342][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:19:08,134][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:19:08,929][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:19:09,723][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:19:10,515][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:19:11,310][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:19:12,104][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:19:12,897][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:19:13,690][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:19:14,486][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:19:15,279][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:19:16,072][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:19:16,868][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:19:18,084][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:19:18,876][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:19:19,668][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:19:20,463][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:19:21,255][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:19:22,048][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:19:22,845][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:19:23,639][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:19:24,431][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:19:26,041][mllm.training.trainer_common][INFO] - For task: Apply reinforce step
,
            ΔVRAM Allocated: 2.288818359375e-05 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:26,
            Percentage of VRAM taken: 69.91579512747062%,
            
[2025-08-20 19:19:26,940][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/policy_optimizer_state.pt
[2025-08-20 19:19:26,941][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/critic_optimizer_state.pt
[2025-08-20 19:19:28,269][__main__][INFO] - Iteration 719 took 54s (37.95% Gen, 62.04% Train). Generation: 20s, Training: 34s. Estimated remaining time: 4h 6m 41s. Estimated total time: 15h 15m 38s. Time estimates for 10 more iterations: 9m 9s, 100 more iterations: 1h 31m 33s, 500 more iterations: 7h 37m 49s.
[2025-08-20 19:19:28,270][__main__][INFO] - Starting iteration 719.
[2025-08-20 19:19:51,531][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 19:19:51,532][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 19:19:51,539][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 19:19:53,979][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 19:19:53,981][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 19:19:53,987][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 19:19:53,989][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data.
[2025-08-20 19:19:53,990][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets.
[2025-08-20 19:19:54,288][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:19:55,080][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:19:55,877][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:19:56,672][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:19:57,463][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:19:58,256][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:19:59,051][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:19:59,844][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:20:00,637][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:20:01,433][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:20:02,228][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:20:03,021][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:20:03,815][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:20:04,611][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:20:05,404][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:20:06,197][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:20:06,993][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:20:07,788][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:20:08,582][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:20:09,375][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:20:10,171][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:20:10,967][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:20:11,761][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:20:12,555][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:20:13,352][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:20:14,602][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:20:15,397][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:20:16,193][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:20:16,990][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:20:17,785][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:20:18,579][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:20:19,372][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:20:20,963][mllm.training.trainer_common][INFO] - For task: Apply reinforce step
,
            ΔVRAM Allocated: 2.288818359375e-05 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:26,
            Percentage of VRAM taken: 69.91579512747062%,
            
[2025-08-20 19:20:21,957][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/policy_optimizer_state.pt
[2025-08-20 19:20:21,958][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/critic_optimizer_state.pt
[2025-08-20 19:20:23,309][__main__][INFO] - Iteration 720 took 55s (37.85% Gen, 62.15% Train). Generation: 20s, Training: 34s. Estimated remaining time: 4h 7m 26s. Estimated total time: 15h 17m 18s. Time estimates for 10 more iterations: 9m 10s, 100 more iterations: 1h 31m 43s, 500 more iterations: 7h 38m 39s.
[2025-08-20 19:20:23,311][__main__][INFO] - Starting iteration 720.
[2025-08-20 19:20:46,525][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 19:20:46,526][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 19:20:46,532][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 19:20:49,000][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 19:20:49,001][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 19:20:49,008][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 19:20:49,010][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data.
[2025-08-20 19:20:49,011][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets.
[2025-08-20 19:20:49,309][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:20:50,100][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:20:50,894][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:20:51,686][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:20:52,479][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:20:53,274][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:20:54,065][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:20:54,858][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:20:55,654][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:20:56,446][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:20:57,239][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:20:58,036][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:20:58,831][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:20:59,624][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:21:00,418][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:21:01,215][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:21:02,010][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:21:02,804][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:21:03,598][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:21:04,395][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:21:05,191][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:21:05,984][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:21:07,294][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:21:08,087][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:21:08,880][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:21:09,674][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:21:10,470][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:21:11,265][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:21:12,058][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:21:12,853][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:21:13,650][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:21:14,444][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:21:16,019][mllm.training.trainer_common][INFO] - For task: Apply reinforce step
,
            ΔVRAM Allocated: 2.288818359375e-05 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:26,
            Percentage of VRAM taken: 69.91579512747062%,
            
[2025-08-20 19:21:16,993][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/policy_optimizer_state.pt
[2025-08-20 19:21:16,995][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/critic_optimizer_state.pt
[2025-08-20 19:21:18,297][__main__][INFO] - Iteration 721 took 54s (37.72% Gen, 62.27% Train). Generation: 20s, Training: 34s. Estimated remaining time: 4h 5m 39s. Estimated total time: 15h 16m 26s. Time estimates for 10 more iterations: 9m 9s, 100 more iterations: 1h 31m 38s, 500 more iterations: 7h 38m 13s.
[2025-08-20 19:21:18,299][__main__][INFO] - Starting iteration 721.
[2025-08-20 19:21:42,193][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 19:21:42,195][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 19:21:42,201][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 19:21:44,637][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 19:21:44,639][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 19:21:44,645][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 19:21:44,647][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data.
[2025-08-20 19:21:44,648][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets.
[2025-08-20 19:21:44,946][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8961 tokens.
[2025-08-20 19:21:45,740][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8961 tokens.
[2025-08-20 19:21:46,532][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8961 tokens.
[2025-08-20 19:21:47,323][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8961 tokens.
[2025-08-20 19:21:48,118][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8961 tokens.
[2025-08-20 19:21:48,910][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8961 tokens.
[2025-08-20 19:21:49,703][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8961 tokens.
[2025-08-20 19:21:50,498][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8961 tokens.
[2025-08-20 19:21:51,291][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8961 tokens.
[2025-08-20 19:21:52,083][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8961 tokens.
[2025-08-20 19:21:52,879][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8961 tokens.
[2025-08-20 19:21:53,675][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8961 tokens.
[2025-08-20 19:21:54,468][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8961 tokens.
[2025-08-20 19:21:55,263][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8961 tokens.
[2025-08-20 19:21:56,057][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8961 tokens.
[2025-08-20 19:21:56,851][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8961 tokens.
[2025-08-20 19:21:57,644][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8961 tokens.
[2025-08-20 19:21:58,437][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8961 tokens.
[2025-08-20 19:21:59,233][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8961 tokens.
[2025-08-20 19:22:00,027][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8961 tokens.
[2025-08-20 19:22:00,820][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8961 tokens.
[2025-08-20 19:22:01,617][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8961 tokens.
[2025-08-20 19:22:02,822][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8961 tokens.
[2025-08-20 19:22:03,617][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8961 tokens.
[2025-08-20 19:22:04,413][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8961 tokens.
[2025-08-20 19:22:05,208][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8961 tokens.
[2025-08-20 19:22:06,001][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8961 tokens.
[2025-08-20 19:22:06,797][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8961 tokens.
[2025-08-20 19:22:07,595][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8961 tokens.
[2025-08-20 19:22:08,391][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8961 tokens.
[2025-08-20 19:22:09,185][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8961 tokens.
[2025-08-20 19:22:09,979][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8961 tokens.
[2025-08-20 19:22:11,609][mllm.training.trainer_common][INFO] - For task: Apply reinforce step
,
            ΔVRAM Allocated: 2.288818359375e-05 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:26,
            Percentage of VRAM taken: 69.91579512747062%,
            
[2025-08-20 19:22:12,539][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/policy_optimizer_state.pt
[2025-08-20 19:22:12,541][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/critic_optimizer_state.pt
[2025-08-20 19:22:13,970][__main__][INFO] - Iteration 722 took 55s (38.52% Gen, 61.47% Train). Generation: 21s, Training: 34s. Estimated remaining time: 4h 16m 7s. Estimated total time: 15h 27m 50s. Time estimates for 10 more iterations: 9m 16s, 100 more iterations: 1h 32m 47s, 500 more iterations: 7h 43m 55s.
[2025-08-20 19:22:13,971][__main__][INFO] - Starting iteration 722.
[2025-08-20 19:22:37,536][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 19:22:37,537][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 19:22:37,543][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 19:22:40,002][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 19:22:40,003][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 19:22:40,010][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 19:22:40,012][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data.
[2025-08-20 19:22:40,012][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets.
[2025-08-20 19:22:40,312][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:22:41,102][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:22:41,893][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:22:42,685][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:22:43,479][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:22:44,271][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:22:45,064][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:22:45,859][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:22:46,652][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:22:47,445][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:22:48,241][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:22:49,035][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:22:49,829][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:22:50,623][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:22:51,419][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:22:52,214][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:22:53,007][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:22:53,802][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:22:55,069][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:22:55,862][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:22:56,656][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:22:57,449][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:22:58,243][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:22:59,039][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:22:59,835][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:23:00,629][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:23:01,423][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:23:02,219][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:23:03,016][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:23:04,002][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:23:04,798][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:23:05,594][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:23:07,174][mllm.training.trainer_common][INFO] - For task: Apply reinforce step
,
            ΔVRAM Allocated: 2.288818359375e-05 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:27,
            Percentage of VRAM taken: 69.91579512747062%,
            
[2025-08-20 19:23:08,114][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/policy_optimizer_state.pt
[2025-08-20 19:23:08,115][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/critic_optimizer_state.pt
[2025-08-20 19:23:09,394][__main__][INFO] - Iteration 723 took 55s (38.08% Gen, 61.92% Train). Generation: 21s, Training: 34s. Estimated remaining time: 4h 11m 4s. Estimated total time: 15h 23m 42s. Time estimates for 10 more iterations: 9m 14s, 100 more iterations: 1h 32m 22s, 500 more iterations: 7h 41m 51s.
[2025-08-20 19:23:09,395][__main__][INFO] - Starting iteration 723.
[2025-08-20 19:23:32,631][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 19:23:32,633][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 19:23:32,639][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 19:23:35,109][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 19:23:35,110][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 19:23:35,116][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 19:23:35,119][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data.
[2025-08-20 19:23:35,119][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets.
[2025-08-20 19:23:35,425][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:23:36,217][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:23:37,008][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:23:37,803][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:23:38,595][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:23:39,387][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:23:40,183][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:23:40,975][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:23:41,768][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:23:42,564][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:23:43,360][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:23:44,153][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:23:44,946][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:23:45,743][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:23:46,538][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:23:47,845][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:23:48,642][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:23:49,436][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:23:50,228][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:23:51,022][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:23:51,816][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:23:52,609][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:23:53,403][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:23:54,200][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:23:54,995][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:23:55,788][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:23:56,583][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:23:57,380][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:23:58,175][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:23:58,969][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:23:59,763][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:24:00,560][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:24:02,237][mllm.training.trainer_common][INFO] - For task: Apply reinforce step
,
            ΔVRAM Allocated: 2.288818359375e-05 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:27,
            Percentage of VRAM taken: 69.91579512747062%,
            
[2025-08-20 19:24:03,185][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/policy_optimizer_state.pt
[2025-08-20 19:24:03,187][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/critic_optimizer_state.pt
[2025-08-20 19:24:04,499][__main__][INFO] - Iteration 724 took 55s (37.71% Gen, 62.28% Train). Generation: 20s, Training: 34s. Estimated remaining time: 4h 4m 49s. Estimated total time: 15h 18m 23s. Time estimates for 10 more iterations: 9m 11s, 100 more iterations: 1h 31m 50s, 500 more iterations: 7h 39m 11s.
[2025-08-20 19:24:04,502][__main__][INFO] - Starting iteration 724.
[2025-08-20 19:24:27,760][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 19:24:27,761][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 19:24:27,767][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 19:24:30,210][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 19:24:30,211][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 19:24:30,218][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 19:24:30,220][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data.
[2025-08-20 19:24:30,221][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets.
[2025-08-20 19:24:30,519][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:24:31,309][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:24:32,101][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:24:32,891][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:24:33,686][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:24:34,476][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:24:35,269][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:24:36,062][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:24:36,854][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:24:37,648][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:24:38,443][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:24:39,236][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:24:40,029][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:24:40,824][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:24:41,617][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:24:42,409][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:24:43,205][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:24:43,999][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:24:44,792][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:24:45,586][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:24:46,883][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:24:47,679][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:24:48,472][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:24:49,267][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:24:50,064][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:24:50,861][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:24:51,656][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:24:52,450][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:24:53,245][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:24:54,042][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:24:54,837][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:24:55,630][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:24:57,238][mllm.training.trainer_common][INFO] - For task: Apply reinforce step
,
            ΔVRAM Allocated: 2.288818359375e-05 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:26,
            Percentage of VRAM taken: 69.91579512747062%,
            
[2025-08-20 19:24:58,208][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/policy_optimizer_state.pt
[2025-08-20 19:24:58,210][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/critic_optimizer_state.pt
[2025-08-20 19:24:59,497][__main__][INFO] - Iteration 725 took 54s (37.84% Gen, 62.15% Train). Generation: 20s, Training: 34s. Estimated remaining time: 4h 2m 3s. Estimated total time: 15h 16m 31s. Time estimates for 10 more iterations: 9m 9s, 100 more iterations: 1h 31m 39s, 500 more iterations: 7h 38m 15s.
[2025-08-20 19:24:59,498][__main__][INFO] - Starting iteration 725.
[2025-08-20 19:25:23,173][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 19:25:23,174][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 19:25:23,180][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 19:25:25,646][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 19:25:25,647][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 19:25:25,653][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 19:25:25,656][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data.
[2025-08-20 19:25:25,656][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets.
[2025-08-20 19:25:25,955][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:25:26,748][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:25:27,539][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:25:28,331][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:25:29,126][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:25:29,918][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:25:30,711][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:25:31,505][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:25:32,298][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:25:33,092][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:25:33,887][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:25:34,680][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:25:35,473][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:25:36,270][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:25:37,064][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:25:38,297][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:25:39,090][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:25:39,886][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:25:40,678][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:25:41,471][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:25:42,267][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:25:43,061][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:25:43,854][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:25:44,648][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:25:45,443][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:25:46,237][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:25:47,030][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:25:47,824][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:25:48,618][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:25:49,410][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:25:50,204][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:25:51,000][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:25:52,568][mllm.training.trainer_common][INFO] - For task: Apply reinforce step
,
            ΔVRAM Allocated: 2.288818359375e-05 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:26,
            Percentage of VRAM taken: 69.91579512747062%,
            
[2025-08-20 19:25:53,536][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/policy_optimizer_state.pt
[2025-08-20 19:25:53,537][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/critic_optimizer_state.pt
[2025-08-20 19:25:54,886][__main__][INFO] - Iteration 726 took 55s (38.30% Gen, 61.70% Train). Generation: 21s, Training: 34s. Estimated remaining time: 4h 7m 43s. Estimated total time: 15h 23m 7s. Time estimates for 10 more iterations: 9m 13s, 100 more iterations: 1h 32m 18s, 500 more iterations: 7h 41m 33s.
[2025-08-20 19:25:54,887][__main__][INFO] - Starting iteration 726.
[2025-08-20 19:26:18,299][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 19:26:18,301][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 19:26:18,307][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 19:26:20,783][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 19:26:20,785][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 19:26:20,791][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 19:26:20,793][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data.
[2025-08-20 19:26:20,794][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets.
[2025-08-20 19:26:21,092][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:26:21,881][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:26:22,674][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:26:23,466][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:26:24,257][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:26:25,052][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:26:25,843][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:26:26,636][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:26:27,432][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:26:28,224][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:26:29,017][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:26:29,813][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:26:30,609][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:26:31,402][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:26:32,195][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:26:32,992][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:26:33,789][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:26:34,582][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:26:35,868][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:26:36,662][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:26:37,455][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:26:38,251][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:26:39,048][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:26:39,842][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:26:40,635][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:26:41,431][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:26:42,229][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:26:43,024][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:26:43,818][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:26:44,612][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:26:45,409][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:26:46,205][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:26:47,800][mllm.training.trainer_common][INFO] - For task: Apply reinforce step
,
            ΔVRAM Allocated: 2.288818359375e-05 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:26,
            Percentage of VRAM taken: 69.91579512747062%,
            
[2025-08-20 19:26:48,768][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/policy_optimizer_state.pt
[2025-08-20 19:26:48,769][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/critic_optimizer_state.pt
[2025-08-20 19:26:50,042][__main__][INFO] - Iteration 727 took 55s (37.99% Gen, 62.01% Train). Generation: 20s, Training: 34s. Estimated remaining time: 4h 2m 55s. Estimated total time: 15h 19m 14s. Time estimates for 10 more iterations: 9m 11s, 100 more iterations: 1h 31m 55s, 500 more iterations: 7h 39m 37s.
[2025-08-20 19:26:50,043][__main__][INFO] - Starting iteration 727.
[2025-08-20 19:27:13,998][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 19:27:13,999][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 19:27:14,005][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 19:27:16,435][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 19:27:16,436][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 19:27:16,442][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 19:27:16,444][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data.
[2025-08-20 19:27:16,445][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets.
[2025-08-20 19:27:16,743][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:27:17,533][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:27:18,324][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:27:19,117][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:27:19,910][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:27:20,703][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:27:21,497][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:27:22,292][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:27:23,085][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:27:23,878][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:27:24,675][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:27:25,469][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:27:26,262][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:27:27,056][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:27:27,853][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:27:28,646][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:27:29,441][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:27:30,236][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:27:31,032][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:27:31,827][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:27:32,620][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:27:33,414][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:27:34,212][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:27:35,007][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:27:36,236][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:27:37,032][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:27:37,827][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:27:38,619][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:27:39,413][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:27:40,209][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:27:41,002][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:27:41,795][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:27:43,373][mllm.training.trainer_common][INFO] - For task: Apply reinforce step
,
            ΔVRAM Allocated: 2.288818359375e-05 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:26,
            Percentage of VRAM taken: 69.91579512747062%,
            
[2025-08-20 19:27:44,368][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/policy_optimizer_state.pt
[2025-08-20 19:27:44,370][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/critic_optimizer_state.pt
[2025-08-20 19:27:45,696][__main__][INFO] - Iteration 728 took 55s (38.66% Gen, 61.34% Train). Generation: 21s, Training: 34s. Estimated remaining time: 4h 10m 17s. Estimated total time: 15h 27m 32s. Time estimates for 10 more iterations: 9m 16s, 100 more iterations: 1h 32m 45s, 500 more iterations: 7h 43m 46s.
[2025-08-20 19:27:45,697][__main__][INFO] - Starting iteration 728.
[2025-08-20 19:28:08,948][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 19:28:08,949][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 19:28:08,956][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 19:28:11,413][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 19:28:11,415][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 19:28:11,421][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 19:28:11,423][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data.
[2025-08-20 19:28:11,424][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets.
[2025-08-20 19:28:11,723][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:28:12,515][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:28:13,305][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:28:14,099][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:28:14,893][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:28:15,685][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:28:16,479][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:28:17,273][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:28:18,065][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:28:18,859][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:28:19,654][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:28:20,446][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:28:21,240][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:28:22,036][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:28:22,830][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:28:23,623][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:28:24,418][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:28:25,213][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:28:26,006][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:28:26,802][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:28:28,089][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:28:28,882][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:28:29,678][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:28:30,474][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:28:31,269][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:28:32,062][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:28:32,856][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:28:33,653][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:28:34,447][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:28:35,241][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:28:36,035][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:28:36,832][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:28:38,402][mllm.training.trainer_common][INFO] - For task: Apply reinforce step
,
            ΔVRAM Allocated: 2.288818359375e-05 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:26,
            Percentage of VRAM taken: 69.91579512747062%,
            
[2025-08-20 19:28:39,306][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/policy_optimizer_state.pt
[2025-08-20 19:28:39,308][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/critic_optimizer_state.pt
[2025-08-20 19:28:40,609][__main__][INFO] - Iteration 729 took 54s (37.87% Gen, 62.13% Train). Generation: 20s, Training: 34s. Estimated remaining time: 3h 57m 1s. Estimated total time: 15h 15m 11s. Time estimates for 10 more iterations: 9m 9s, 100 more iterations: 1h 31m 31s, 500 more iterations: 7h 37m 35s.
[2025-08-20 19:28:40,611][__main__][INFO] - Starting iteration 729.
[2025-08-20 19:29:03,908][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 19:29:03,910][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 19:29:03,917][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 19:29:06,383][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 19:29:06,384][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 19:29:06,390][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 19:29:06,393][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data.
[2025-08-20 19:29:06,393][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets.
[2025-08-20 19:29:06,696][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:29:07,485][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:29:08,279][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:29:09,070][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:29:09,864][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:29:10,658][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:29:11,451][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:29:12,244][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:29:13,040][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:29:13,833][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:29:14,626][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:29:15,421][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:29:16,216][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:29:17,010][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:29:17,804][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:29:18,601][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:29:19,396][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:29:20,190][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:29:20,983][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:29:21,779][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:29:23,090][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:29:23,883][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:29:24,679][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:29:25,474][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:29:26,267][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:29:27,060][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:29:27,858][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:29:28,654][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:29:29,449][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:29:30,242][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:29:31,037][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:29:31,834][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:29:33,446][mllm.training.trainer_common][INFO] - For task: Apply reinforce step
,
            ΔVRAM Allocated: 2.288818359375e-05 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:27,
            Percentage of VRAM taken: 69.91579512747062%,
            
[2025-08-20 19:29:34,463][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/policy_optimizer_state.pt
[2025-08-20 19:29:34,465][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/critic_optimizer_state.pt
[2025-08-20 19:29:35,732][__main__][INFO] - Iteration 730 took 55s (37.80% Gen, 62.20% Train). Generation: 20s, Training: 34s. Estimated remaining time: 3h 59m 36s. Estimated total time: 15h 18m 40s. Time estimates for 10 more iterations: 9m 11s, 100 more iterations: 1h 31m 52s, 500 more iterations: 7h 39m 20s.
[2025-08-20 19:29:35,734][__main__][INFO] - Starting iteration 730.
[2025-08-20 19:30:00,197][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 19:30:00,199][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 19:30:00,205][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 19:30:02,665][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 19:30:02,667][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 19:30:02,673][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 19:30:02,675][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data.
[2025-08-20 19:30:02,675][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets.
[2025-08-20 19:30:02,974][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:30:03,764][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:30:04,554][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:30:05,346][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:30:06,140][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:30:06,932][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:30:07,725][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:30:08,518][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:30:09,310][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:30:10,104][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:30:10,899][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:30:11,691][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:30:12,485][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:30:13,280][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:30:14,073][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:30:14,866][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:30:15,662][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:30:16,456][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:30:17,666][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:30:18,463][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:30:19,259][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:30:20,052][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:30:20,845][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:30:21,641][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:30:22,436][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:30:23,229][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:30:24,023][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:30:24,819][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:30:25,615][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:30:26,408][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:30:27,202][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:30:27,999][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:30:29,607][mllm.training.trainer_common][INFO] - For task: Apply reinforce step
,
            ΔVRAM Allocated: 2.288818359375e-05 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:26,
            Percentage of VRAM taken: 69.91579512747062%,
            
[2025-08-20 19:30:30,593][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/policy_optimizer_state.pt
[2025-08-20 19:30:30,595][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/critic_optimizer_state.pt
[2025-08-20 19:30:32,161][__main__][INFO] - Iteration 731 took 56s (38.99% Gen, 61.01% Train). Generation: 22s, Training: 34s. Estimated remaining time: 4h 20m 25s. Estimated total time: 15h 40m 27s. Time estimates for 10 more iterations: 9m 24s, 100 more iterations: 1h 34m 2s, 500 more iterations: 7h 50m 13s.
[2025-08-20 19:30:32,163][__main__][INFO] - Starting iteration 731.
[2025-08-20 19:30:55,598][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 19:30:55,600][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 19:30:55,606][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 19:30:58,069][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 19:30:58,070][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 19:30:58,077][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 19:30:58,078][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data.
[2025-08-20 19:30:58,079][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets.
[2025-08-20 19:30:58,379][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:30:59,169][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:30:59,963][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:31:00,755][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:31:01,548][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:31:02,343][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:31:03,135][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:31:03,928][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:31:04,724][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:31:05,517][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:31:06,310][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:31:07,106][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:31:07,901][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:31:08,694][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:31:09,487][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:31:10,284][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:31:11,079][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:31:11,871][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:31:12,665][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:31:13,461][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:31:14,254][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:31:15,047][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:31:16,400][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:31:17,194][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:31:17,989][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:31:18,784][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:31:19,580][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:31:20,374][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:31:21,168][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:31:21,963][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:31:22,761][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:31:23,556][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:31:25,196][mllm.training.trainer_common][INFO] - For task: Apply reinforce step
,
            ΔVRAM Allocated: 2.288818359375e-05 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:27,
            Percentage of VRAM taken: 69.91579512747062%,
            
[2025-08-20 19:31:26,165][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/policy_optimizer_state.pt
[2025-08-20 19:31:26,166][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/critic_optimizer_state.pt
[2025-08-20 19:31:27,484][__main__][INFO] - Iteration 732 took 55s (37.93% Gen, 62.07% Train). Generation: 20s, Training: 34s. Estimated remaining time: 4h 1m 4s. Estimated total time: 15h 22m 0s. Time estimates for 10 more iterations: 9m 13s, 100 more iterations: 1h 32m 12s, 500 more iterations: 7h 41m 0s.
[2025-08-20 19:31:27,485][__main__][INFO] - Starting iteration 732.
[2025-08-20 19:31:51,410][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 19:31:51,411][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 19:31:51,418][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 19:31:53,900][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 19:31:53,902][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 19:31:53,908][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 19:31:53,910][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data.
[2025-08-20 19:31:53,911][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets.
[2025-08-20 19:31:54,209][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:31:54,998][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:31:55,791][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:31:56,585][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:31:57,376][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:31:58,170][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:31:58,964][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:31:59,756][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:32:00,550][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:32:01,345][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:32:02,138][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:32:02,931][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:32:03,727][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:32:04,521][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:32:05,314][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:32:06,109][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:32:06,905][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:32:07,699][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:32:08,492][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:32:09,287][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:32:10,083][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:32:11,386][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:32:12,186][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:32:12,981][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:32:13,774][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:32:14,569][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:32:15,365][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:32:16,161][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:32:16,954][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:32:17,748][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:32:18,544][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:32:19,340][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:32:20,968][mllm.training.trainer_common][INFO] - For task: Apply reinforce step
,
            ΔVRAM Allocated: 2.288818359375e-05 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:27,
            Percentage of VRAM taken: 69.91579512747062%,
            
[2025-08-20 19:32:21,927][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/policy_optimizer_state.pt
[2025-08-20 19:32:21,929][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/critic_optimizer_state.pt
[2025-08-20 19:32:23,213][__main__][INFO] - Iteration 733 took 55s (38.52% Gen, 61.48% Train). Generation: 21s, Training: 34s. Estimated remaining time: 4h 6m 55s. Estimated total time: 15h 28m 47s. Time estimates for 10 more iterations: 9m 17s, 100 more iterations: 1h 32m 52s, 500 more iterations: 7h 44m 23s.
[2025-08-20 19:32:23,214][__main__][INFO] - Starting iteration 733.
[2025-08-20 19:32:46,820][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 19:32:46,821][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 19:32:46,828][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 19:32:49,308][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 19:32:49,309][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 19:32:49,316][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 19:32:49,318][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data.
[2025-08-20 19:32:49,319][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets.
[2025-08-20 19:32:49,618][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:32:50,410][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:32:51,202][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:32:51,994][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:32:52,789][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:32:53,581][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:32:54,374][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:32:55,171][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:32:55,965][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:32:56,758][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:32:57,552][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:32:58,347][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:32:59,142][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:32:59,935][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:33:00,731][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:33:01,527][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:33:02,322][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:33:03,115][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:33:03,911][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:33:04,706][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:33:05,914][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:33:06,708][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:33:07,502][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:33:08,296][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:33:09,090][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:33:09,887][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:33:10,683][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:33:11,476][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:33:12,270][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:33:13,067][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:33:13,863][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:33:14,658][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:33:16,244][mllm.training.trainer_common][INFO] - For task: Apply reinforce step
,
            ΔVRAM Allocated: 2.288818359375e-05 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:26,
            Percentage of VRAM taken: 69.91579512747062%,
            
[2025-08-20 19:33:17,235][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/policy_optimizer_state.pt
[2025-08-20 19:33:17,236][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/critic_optimizer_state.pt
[2025-08-20 19:33:18,547][__main__][INFO] - Iteration 734 took 55s (38.20% Gen, 61.80% Train). Generation: 21s, Training: 34s. Estimated remaining time: 3h 59m 24s. Estimated total time: 15h 22m 12s. Time estimates for 10 more iterations: 9m 13s, 100 more iterations: 1h 32m 13s, 500 more iterations: 7h 41m 6s.
[2025-08-20 19:33:18,548][__main__][INFO] - Starting iteration 734.
[2025-08-20 19:33:41,738][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 19:33:41,740][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 19:33:41,746][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 19:33:44,187][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 19:33:44,188][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 19:33:44,194][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 19:33:44,196][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data.
[2025-08-20 19:33:44,197][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets.
[2025-08-20 19:33:44,498][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:33:45,290][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:33:46,082][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:33:46,876][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:33:47,670][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:33:48,462][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:33:49,256][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:33:50,051][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:33:50,844][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:33:51,637][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:33:52,433][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:33:53,228][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:33:54,021][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:33:54,815][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:33:55,612][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:33:56,404][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:33:57,198][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:33:57,993][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:33:58,789][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:33:59,583][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:34:00,378][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:34:01,172][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:34:01,968][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:34:02,762][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:34:04,121][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:34:04,914][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:34:05,711][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:34:06,506][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:34:07,299][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:34:08,094][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:34:08,891][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:34:09,689][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:34:11,371][mllm.training.trainer_common][INFO] - For task: Apply reinforce step
,
            ΔVRAM Allocated: 2.288818359375e-05 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:27,
            Percentage of VRAM taken: 69.91579512747062%,
            
[2025-08-20 19:34:12,311][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/policy_optimizer_state.pt
[2025-08-20 19:34:12,313][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/critic_optimizer_state.pt
[2025-08-20 19:34:13,582][__main__][INFO] - Iteration 735 took 55s (37.73% Gen, 62.26% Train). Generation: 20s, Training: 34s. Estimated remaining time: 3h 53m 30s. Estimated total time: 15h 17m 13s. Time estimates for 10 more iterations: 9m 10s, 100 more iterations: 1h 31m 43s, 500 more iterations: 7h 38m 36s.
[2025-08-20 19:34:13,583][__main__][INFO] - Starting iteration 735.
[2025-08-20 19:34:36,894][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 19:34:36,895][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 19:34:36,901][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 19:34:39,332][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 19:34:39,333][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 19:34:39,339][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 19:34:39,342][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data.
[2025-08-20 19:34:39,342][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets.
[2025-08-20 19:34:39,640][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:34:40,430][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:34:41,220][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:34:42,013][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:34:42,804][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:34:43,598][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:34:44,390][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:34:45,183][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:34:45,977][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:34:46,770][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:34:47,562][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:34:48,357][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:34:49,151][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:34:49,943][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:34:50,737][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:34:51,531][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:34:52,323][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:34:53,118][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:34:53,915][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:34:54,708][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:34:55,500][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:34:56,779][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:34:57,575][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:34:58,371][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:34:59,166][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:34:59,959][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:35:00,755][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:35:01,551][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:35:02,345][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:35:03,139][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:35:03,935][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:35:04,733][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:35:06,408][mllm.training.trainer_common][INFO] - For task: Apply reinforce step
,
            ΔVRAM Allocated: 2.288818359375e-05 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:27,
            Percentage of VRAM taken: 69.91579512747062%,
            
[2025-08-20 19:35:07,334][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/policy_optimizer_state.pt
[2025-08-20 19:35:07,336][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/critic_optimizer_state.pt
[2025-08-20 19:35:08,765][__main__][INFO] - Iteration 736 took 55s (37.87% Gen, 62.13% Train). Generation: 20s, Training: 34s. Estimated remaining time: 3h 55m 3s. Estimated total time: 15h 19m 40s. Time estimates for 10 more iterations: 9m 11s, 100 more iterations: 1h 31m 58s, 500 more iterations: 7h 39m 50s.
[2025-08-20 19:35:08,766][__main__][INFO] - Starting iteration 736.
[2025-08-20 19:35:31,922][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 19:35:31,923][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 19:35:31,929][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 19:35:34,407][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 19:35:34,408][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 19:35:34,415][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 19:35:34,417][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data.
[2025-08-20 19:35:34,418][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets.
[2025-08-20 19:35:34,716][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:35:35,504][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:35:36,299][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:35:37,090][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:35:37,882][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:35:38,677][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:35:39,468][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:35:40,261][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:35:41,054][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:35:41,846][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:35:42,640][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:35:43,433][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:35:44,226][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:35:45,020][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:35:45,816][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:35:46,611][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:35:47,403][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:35:48,198][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:35:48,992][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:35:49,785][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:35:50,580][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:35:51,377][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:35:52,172][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:35:53,387][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:35:54,182][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:35:54,978][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:35:55,775][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:35:56,569][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:35:57,364][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:35:58,159][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:35:58,956][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:35:59,754][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:36:01,374][mllm.training.trainer_common][INFO] - For task: Apply reinforce step
,
            ΔVRAM Allocated: 2.288818359375e-05 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:26,
            Percentage of VRAM taken: 69.91579512747062%,
            
[2025-08-20 19:36:02,386][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/policy_optimizer_state.pt
[2025-08-20 19:36:02,388][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/critic_optimizer_state.pt
[2025-08-20 19:36:03,707][__main__][INFO] - Iteration 737 took 54s (37.68% Gen, 62.32% Train). Generation: 20s, Training: 34s. Estimated remaining time: 3h 50m 7s. Estimated total time: 15h 15m 40s. Time estimates for 10 more iterations: 9m 9s, 100 more iterations: 1h 31m 34s, 500 more iterations: 7h 37m 50s.
[2025-08-20 19:36:03,709][__main__][INFO] - Starting iteration 737.
[2025-08-20 19:36:27,258][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 19:36:27,259][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 19:36:27,265][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 19:36:29,722][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 19:36:29,723][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 19:36:29,729][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 19:36:29,731][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data.
[2025-08-20 19:36:29,732][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets.
[2025-08-20 19:36:30,031][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:36:30,822][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:36:31,613][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:36:32,405][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:36:33,200][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:36:33,992][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:36:34,785][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:36:35,580][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:36:36,373][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:36:37,166][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:36:37,962][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:36:38,757][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:36:39,550][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:36:40,344][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:36:41,140][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:36:41,934][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:36:42,728][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:36:43,521][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:36:44,316][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:36:45,110][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:36:45,905][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:36:46,701][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:36:47,496][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:36:48,830][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:36:49,623][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:36:50,420][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:36:51,216][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:36:52,010][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:36:52,804][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:36:53,598][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:36:54,394][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:36:55,189][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:36:56,798][mllm.training.trainer_common][INFO] - For task: Apply reinforce step
,
            ΔVRAM Allocated: 2.288818359375e-05 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:27,
            Percentage of VRAM taken: 69.91579512747062%,
            
[2025-08-20 19:36:57,703][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/policy_optimizer_state.pt
[2025-08-20 19:36:57,704][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/critic_optimizer_state.pt
[2025-08-20 19:36:59,101][__main__][INFO] - Iteration 738 took 55s (38.08% Gen, 61.91% Train). Generation: 21s, Training: 34s. Estimated remaining time: 3h 56m 44s. Estimated total time: 15h 23m 12s. Time estimates for 10 more iterations: 9m 13s, 100 more iterations: 1h 32m 19s, 500 more iterations: 7h 41m 36s.
[2025-08-20 19:36:59,103][__main__][INFO] - Starting iteration 738.
[2025-08-20 19:37:22,284][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 19:37:22,285][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 19:37:22,291][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 19:37:24,745][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 19:37:24,747][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 19:37:24,753][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 19:37:24,755][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data.
[2025-08-20 19:37:24,756][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets.
[2025-08-20 19:37:25,054][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:37:25,846][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:37:26,637][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:37:27,429][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:37:28,226][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:37:29,020][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:37:29,813][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:37:30,608][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:37:31,404][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:37:32,197][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:37:32,990][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:37:33,785][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:37:34,581][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:37:35,373][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:37:36,168][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:37:36,964][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:37:37,760][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:37:38,553][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:37:39,347][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:37:40,143][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:37:40,939][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:37:41,732][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:37:42,526][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:37:43,890][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:37:44,684][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:37:45,480][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:37:46,276][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:37:47,071][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:37:47,865][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:37:48,663][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:37:49,461][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:37:50,258][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:37:51,869][mllm.training.trainer_common][INFO] - For task: Apply reinforce step
,
            ΔVRAM Allocated: 2.288818359375e-05 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:27,
            Percentage of VRAM taken: 69.91579512747062%,
            
[2025-08-20 19:37:52,913][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/policy_optimizer_state.pt
[2025-08-20 19:37:52,916][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/critic_optimizer_state.pt
[2025-08-20 19:37:54,234][__main__][INFO] - Iteration 739 took 55s (37.62% Gen, 62.38% Train). Generation: 20s, Training: 34s. Estimated remaining time: 3h 51m 27s. Estimated total time: 15h 18m 50s. Time estimates for 10 more iterations: 9m 11s, 100 more iterations: 1h 31m 53s, 500 more iterations: 7h 39m 25s.
[2025-08-20 19:37:54,235][__main__][INFO] - Starting iteration 739.
[2025-08-20 19:38:17,416][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 19:38:17,417][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 19:38:17,423][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 19:38:19,883][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 19:38:19,884][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 19:38:19,891][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 19:38:19,893][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data.
[2025-08-20 19:38:19,893][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets.
[2025-08-20 19:38:20,192][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:38:20,984][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:38:21,775][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:38:22,570][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:38:23,363][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:38:24,155][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:38:24,949][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:38:25,744][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:38:26,536][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:38:27,330][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:38:28,126][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:38:28,920][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:38:29,713][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:38:30,510][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:38:31,307][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:38:32,101][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:38:32,895][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:38:33,689][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:38:34,486][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:38:35,281][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:38:36,074][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:38:37,315][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:38:38,108][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:38:38,905][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:38:39,701][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:38:40,494][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:38:41,289][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:38:42,086][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:38:42,883][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:38:43,680][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:38:44,475][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:38:45,270][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:38:46,888][mllm.training.trainer_common][INFO] - For task: Apply reinforce step
,
            ΔVRAM Allocated: 2.288818359375e-05 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:26,
            Percentage of VRAM taken: 69.91579512747062%,
            
[2025-08-20 19:38:47,878][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/policy_optimizer_state.pt
[2025-08-20 19:38:47,880][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/critic_optimizer_state.pt
[2025-08-20 19:38:49,260][__main__][INFO] - Iteration 740 took 55s (37.66% Gen, 62.34% Train). Generation: 20s, Training: 34s. Estimated remaining time: 3h 48m 46s. Estimated total time: 15h 17m 4s. Time estimates for 10 more iterations: 9m 10s, 100 more iterations: 1h 31m 42s, 500 more iterations: 7h 38m 32s.
[2025-08-20 19:38:49,262][__main__][INFO] - Starting iteration 740.
[2025-08-20 19:39:12,527][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 19:39:12,528][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 19:39:12,534][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 19:39:14,983][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 19:39:14,985][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 19:39:14,991][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 19:39:14,993][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data.
[2025-08-20 19:39:14,994][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets.
[2025-08-20 19:39:15,295][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:39:16,088][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:39:16,881][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:39:17,674][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:39:18,470][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:39:19,263][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:39:20,057][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:39:20,852][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:39:21,647][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:39:22,440][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:39:23,234][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:39:24,030][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:39:24,825][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:39:25,619][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:39:26,413][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:39:27,209][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:39:28,005][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:39:28,799][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:39:29,593][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:39:30,390][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:39:31,186][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:39:31,980][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:39:32,774][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:39:34,126][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:39:34,921][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:39:35,715][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:39:36,511][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:39:37,308][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:39:38,105][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:39:38,899][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:39:39,695][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:39:40,490][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:39:42,120][mllm.training.trainer_common][INFO] - For task: Apply reinforce step
,
            ΔVRAM Allocated: 2.288818359375e-05 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:27,
            Percentage of VRAM taken: 69.91579512747062%,
            
[2025-08-20 19:39:42,993][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/policy_optimizer_state.pt
[2025-08-20 19:39:42,995][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/critic_optimizer_state.pt
[2025-08-20 19:39:44,429][__main__][INFO] - Iteration 741 took 55s (37.70% Gen, 62.30% Train). Generation: 20s, Training: 34s. Estimated remaining time: 3h 50m 13s. Estimated total time: 15h 19m 27s. Time estimates for 10 more iterations: 9m 11s, 100 more iterations: 1h 31m 56s, 500 more iterations: 7h 39m 43s.
[2025-08-20 19:39:44,431][__main__][INFO] - Starting iteration 741.
[2025-08-20 19:40:07,551][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 19:40:07,552][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 19:40:07,559][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 19:40:10,030][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 19:40:10,031][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 19:40:10,038][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 19:40:10,040][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data.
[2025-08-20 19:40:10,041][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets.
[2025-08-20 19:40:10,339][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:40:11,131][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:40:11,922][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:40:12,715][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:40:13,507][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:40:14,299][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:40:15,094][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:40:15,887][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:40:16,679][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:40:17,473][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:40:18,267][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:40:19,059][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:40:19,854][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:40:20,649][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:40:21,441][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:40:22,234][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:40:23,031][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:40:23,826][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:40:25,101][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:40:25,895][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:40:26,692][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:40:27,487][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:40:28,282][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:40:29,075][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:40:29,872][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:40:30,669][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:40:31,465][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:40:32,258][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:40:33,053][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:40:33,852][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:40:34,651][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:40:35,450][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:40:37,105][mllm.training.trainer_common][INFO] - For task: Apply reinforce step
,
            ΔVRAM Allocated: 2.288818359375e-05 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:27,
            Percentage of VRAM taken: 69.91579512747062%,
            
[2025-08-20 19:40:38,019][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/policy_optimizer_state.pt
[2025-08-20 19:40:38,020][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/critic_optimizer_state.pt
[2025-08-20 19:40:39,272][__main__][INFO] - Iteration 742 took 54s (37.72% Gen, 62.28% Train). Generation: 20s, Training: 34s. Estimated remaining time: 3h 43m 53s. Estimated total time: 15h 14m 1s. Time estimates for 10 more iterations: 9m 8s, 100 more iterations: 1h 31m 24s, 500 more iterations: 7h 37m 0s.
[2025-08-20 19:40:39,274][__main__][INFO] - Starting iteration 742.
[2025-08-20 19:41:02,801][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 19:41:02,803][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 19:41:02,809][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 19:41:05,276][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 19:41:05,277][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 19:41:05,284][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 19:41:05,286][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data.
[2025-08-20 19:41:05,286][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets.
[2025-08-20 19:41:05,591][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:41:06,380][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:41:07,175][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:41:07,967][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:41:08,760][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:41:09,556][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:41:10,349][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:41:11,142][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:41:11,938][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:41:12,733][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:41:13,525][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:41:14,318][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:41:15,115][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:41:15,910][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:41:16,704][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:41:17,497][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:41:18,294][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:41:19,090][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:41:19,883][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:41:20,677][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:41:21,884][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:41:22,677][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:41:23,474][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:41:24,269][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:41:25,063][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:41:25,856][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:41:26,653][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:41:27,449][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:41:28,244][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:41:29,039][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:41:29,835][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:41:30,632][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:41:32,242][mllm.training.trainer_common][INFO] - For task: Apply reinforce step
,
            ΔVRAM Allocated: 2.288818359375e-05 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:26,
            Percentage of VRAM taken: 69.91579512747062%,
            
[2025-08-20 19:41:33,224][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/policy_optimizer_state.pt
[2025-08-20 19:41:33,225][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/critic_optimizer_state.pt
[2025-08-20 19:41:34,539][__main__][INFO] - Iteration 743 took 55s (38.11% Gen, 61.88% Train). Generation: 21s, Training: 34s. Estimated remaining time: 3h 50m 1s. Estimated total time: 15h 21m 4s. Time estimates for 10 more iterations: 9m 12s, 100 more iterations: 1h 32m 6s, 500 more iterations: 7h 40m 32s.
[2025-08-20 19:41:34,540][__main__][INFO] - Starting iteration 743.
[2025-08-20 19:41:57,732][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 19:41:57,734][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 19:41:57,740][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 19:42:00,207][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 19:42:00,209][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 19:42:00,215][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 19:42:00,217][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data.
[2025-08-20 19:42:00,218][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets.
[2025-08-20 19:42:00,518][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:42:01,309][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:42:02,102][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:42:02,896][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:42:03,688][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:42:04,481][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:42:05,276][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:42:06,068][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:42:06,862][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:42:07,658][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:42:08,453][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:42:09,246][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:42:10,040][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:42:10,837][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:42:11,631][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:42:12,425][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:42:13,220][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:42:14,017][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:42:14,813][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:42:15,607][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:42:16,401][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:42:17,198][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:42:17,995][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:42:19,261][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:42:20,055][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:42:20,849][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:42:21,642][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:42:22,438][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:42:23,235][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:42:24,031][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:42:24,826][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:42:25,620][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:42:27,256][mllm.training.trainer_common][INFO] - For task: Apply reinforce step
,
            ΔVRAM Allocated: 2.288818359375e-05 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:27,
            Percentage of VRAM taken: 69.91579512747062%,
            
[2025-08-20 19:42:28,195][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/policy_optimizer_state.pt
[2025-08-20 19:42:28,196][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/critic_optimizer_state.pt
[2025-08-20 19:42:29,588][__main__][INFO] - Iteration 744 took 55s (37.69% Gen, 62.31% Train). Generation: 20s, Training: 34s. Estimated remaining time: 3h 45m 29s. Estimated total time: 15h 17m 27s. Time estimates for 10 more iterations: 9m 10s, 100 more iterations: 1h 31m 44s, 500 more iterations: 7h 38m 43s.
[2025-08-20 19:42:29,590][__main__][INFO] - Starting iteration 744.
[2025-08-20 19:42:53,389][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 19:42:53,390][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 19:42:53,396][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 19:42:55,830][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 19:42:55,832][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 19:42:55,838][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 19:42:55,840][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data.
[2025-08-20 19:42:55,841][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets.
[2025-08-20 19:42:56,139][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:42:56,929][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:42:57,723][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:42:58,516][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:42:59,310][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:43:00,104][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:43:00,899][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:43:01,692][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:43:02,485][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:43:03,281][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:43:04,075][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:43:04,868][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:43:05,663][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:43:06,458][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:43:07,252][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:43:08,045][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:43:08,843][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:43:09,639][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:43:10,434][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:43:11,228][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:43:12,022][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:43:12,819][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:43:13,614][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:43:14,895][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:43:15,689][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:43:16,483][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:43:17,280][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:43:18,077][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:43:18,872][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:43:19,667][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:43:20,462][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:43:21,259][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:43:22,912][mllm.training.trainer_common][INFO] - For task: Apply reinforce step
,
            ΔVRAM Allocated: 2.288818359375e-05 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:27,
            Percentage of VRAM taken: 69.91579512747062%,
            
[2025-08-20 19:43:23,808][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/policy_optimizer_state.pt
[2025-08-20 19:43:23,810][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/critic_optimizer_state.pt
[2025-08-20 19:43:25,197][__main__][INFO] - Iteration 745 took 55s (38.42% Gen, 61.58% Train). Generation: 21s, Training: 34s. Estimated remaining time: 3h 53m 52s. Estimated total time: 15h 26m 46s. Time estimates for 10 more iterations: 9m 16s, 100 more iterations: 1h 32m 40s, 500 more iterations: 7h 43m 23s.
[2025-08-20 19:43:25,199][__main__][INFO] - Starting iteration 745.
[2025-08-20 19:43:48,880][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 19:43:48,882][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 19:43:48,888][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 19:43:51,359][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 19:43:51,361][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 19:43:51,367][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 19:43:51,369][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data.
[2025-08-20 19:43:51,370][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets.
[2025-08-20 19:43:51,668][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:43:52,459][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:43:53,251][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:43:54,046][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:43:54,840][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:43:55,632][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:43:56,426][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:43:57,221][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:43:58,012][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:43:58,807][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:43:59,604][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:44:00,399][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:44:01,192][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:44:01,986][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:44:02,783][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:44:03,578][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:44:04,371][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:44:05,166][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:44:05,963][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:44:06,758][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:44:07,959][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:44:08,754][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:44:09,547][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:44:10,342][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:44:11,141][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:44:11,937][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:44:12,730][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:44:13,525][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:44:14,322][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:44:15,121][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:44:15,920][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:44:16,717][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:44:18,329][mllm.training.trainer_common][INFO] - For task: Apply reinforce step
,
            ΔVRAM Allocated: 2.288818359375e-05 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:26,
            Percentage of VRAM taken: 69.91579512747062%,
            
[2025-08-20 19:44:19,337][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/policy_optimizer_state.pt
[2025-08-20 19:44:19,339][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/critic_optimizer_state.pt
[2025-08-20 19:44:20,659][__main__][INFO] - Iteration 746 took 55s (38.27% Gen, 61.73% Train). Generation: 21s, Training: 34s. Estimated remaining time: 3h 50m 29s. Estimated total time: 15h 24m 19s. Time estimates for 10 more iterations: 9m 14s, 100 more iterations: 1h 32m 25s, 500 more iterations: 7h 42m 9s.
[2025-08-20 19:44:20,660][__main__][INFO] - Starting iteration 746.
[2025-08-20 19:44:43,858][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 19:44:43,859][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 19:44:43,866][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 19:44:46,299][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 19:44:46,301][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 19:44:46,307][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 19:44:46,309][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data.
[2025-08-20 19:44:46,310][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets.
[2025-08-20 19:44:46,609][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:44:47,400][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:44:48,192][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:44:48,988][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:44:49,781][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:44:50,574][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:44:51,369][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:44:52,164][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:44:52,957][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:44:53,751][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:44:54,547][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:44:55,342][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:44:56,135][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:44:56,929][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:44:57,727][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:44:58,522][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:44:59,316][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:45:00,110][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:45:00,906][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:45:01,702][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:45:02,963][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:45:03,757][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:45:04,550][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:45:05,345][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:45:06,143][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:45:06,938][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:45:07,730][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:45:08,526][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:45:09,323][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:45:10,120][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:45:10,914][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:45:11,709][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:45:13,316][mllm.training.trainer_common][INFO] - For task: Apply reinforce step
,
            ΔVRAM Allocated: 2.288818359375e-05 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:26,
            Percentage of VRAM taken: 69.91579512747062%,
            
[2025-08-20 19:45:14,284][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/policy_optimizer_state.pt
[2025-08-20 19:45:14,285][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/critic_optimizer_state.pt
[2025-08-20 19:45:15,619][__main__][INFO] - Iteration 747 took 54s (37.76% Gen, 62.24% Train). Generation: 20s, Training: 34s. Estimated remaining time: 3h 41m 14s. Estimated total time: 15h 15m 59s. Time estimates for 10 more iterations: 9m 9s, 100 more iterations: 1h 31m 35s, 500 more iterations: 7h 37m 59s.
[2025-08-20 19:45:15,621][__main__][INFO] - Starting iteration 747.
[2025-08-20 19:45:39,455][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 19:45:39,457][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 19:45:39,463][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 19:45:41,918][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 19:45:41,920][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 19:45:41,926][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 19:45:41,928][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data.
[2025-08-20 19:45:41,929][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets.
[2025-08-20 19:45:42,227][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:45:43,015][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:45:43,809][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:45:44,600][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:45:45,391][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:45:46,184][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:45:46,975][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:45:47,770][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:45:48,562][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:45:49,354][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:45:50,149][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:45:50,942][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:45:51,735][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:45:52,529][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:45:53,323][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:45:54,116][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:45:54,910][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:45:55,705][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:45:56,497][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:45:57,290][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:45:58,086][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:45:59,375][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:46:00,170][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:46:00,966][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:46:01,761][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:46:02,554][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:46:03,349][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:46:04,147][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:46:04,945][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:46:05,742][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:46:06,538][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:46:07,333][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:46:08,994][mllm.training.trainer_common][INFO] - For task: Apply reinforce step
,
            ΔVRAM Allocated: 2.288818359375e-05 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:27,
            Percentage of VRAM taken: 69.91579512747062%,
            
[2025-08-20 19:46:10,077][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/policy_optimizer_state.pt
[2025-08-20 19:46:10,079][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/critic_optimizer_state.pt
[2025-08-20 19:46:11,507][__main__][INFO] - Iteration 748 took 55s (38.26% Gen, 61.74% Train). Generation: 21s, Training: 34s. Estimated remaining time: 3h 55m 45s. Estimated total time: 15h 31m 26s. Time estimates for 10 more iterations: 9m 18s, 100 more iterations: 1h 33m 8s, 500 more iterations: 7h 45m 43s.
[2025-08-20 19:46:11,509][__main__][INFO] - Starting iteration 748.
[2025-08-20 19:46:34,804][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 19:46:34,806][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 19:46:34,812][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 19:46:37,272][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 19:46:37,273][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 19:46:37,279][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 19:46:37,281][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data.
[2025-08-20 19:46:37,282][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets.
[2025-08-20 19:46:37,581][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:46:38,373][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:46:39,166][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:46:39,958][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:46:40,753][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:46:41,546][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:46:42,339][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:46:43,135][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:46:43,930][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:46:44,724][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:46:45,516][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:46:46,312][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:46:47,107][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:46:47,901][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:46:48,695][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:46:49,491][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:46:50,286][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:46:51,079][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:46:51,873][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:46:52,670][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:46:53,980][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:46:54,773][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:46:55,571][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:46:56,367][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:46:57,161][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:46:57,956][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:46:58,751][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:46:59,549][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:47:00,346][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:47:01,141][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:47:01,935][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:47:02,731][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:47:04,345][mllm.training.trainer_common][INFO] - For task: Apply reinforce step
,
            ΔVRAM Allocated: 2.288818359375e-05 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:27,
            Percentage of VRAM taken: 69.91579512747062%,
            
[2025-08-20 19:47:05,275][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/policy_optimizer_state.pt
[2025-08-20 19:47:05,276][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/critic_optimizer_state.pt
[2025-08-20 19:47:06,557][__main__][INFO] - Iteration 749 took 55s (37.84% Gen, 62.16% Train). Generation: 20s, Training: 34s. Estimated remaining time: 3h 40m 52s. Estimated total time: 15h 17m 27s. Time estimates for 10 more iterations: 9m 10s, 100 more iterations: 1h 31m 44s, 500 more iterations: 7h 38m 43s.
[2025-08-20 19:47:06,558][__main__][INFO] - Starting iteration 749.
[2025-08-20 19:47:30,258][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 19:47:30,260][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 19:47:30,266][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 19:47:32,756][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 19:47:32,758][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 19:47:32,764][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 19:47:32,766][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data.
[2025-08-20 19:47:32,766][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets.
[2025-08-20 19:47:33,066][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:47:33,857][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:47:34,651][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:47:35,443][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:47:36,238][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:47:37,032][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:47:37,825][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:47:38,619][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:47:39,415][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:47:40,210][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:47:41,003][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:47:41,797][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:47:42,593][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:47:43,388][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:47:44,182][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:47:44,976][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:47:45,772][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:47:46,566][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:47:47,359][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:47:48,155][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:47:48,952][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:47:49,748][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:47:50,541][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:47:51,335][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:47:52,134][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:47:52,932][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:47:54,258][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:47:55,053][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:47:55,852][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:47:56,649][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:47:57,445][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:47:58,240][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:47:59,854][mllm.training.trainer_common][INFO] - For task: Apply reinforce step
,
            ΔVRAM Allocated: 2.288818359375e-05 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:27,
            Percentage of VRAM taken: 69.91579512747062%,
            
[2025-08-20 19:48:00,819][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/policy_optimizer_state.pt
[2025-08-20 19:48:00,821][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/critic_optimizer_state.pt
[2025-08-20 19:48:02,077][__main__][INFO] - Iteration 750 took 55s (38.26% Gen, 61.74% Train). Generation: 21s, Training: 34s. Estimated remaining time: 3h 47m 47s. Estimated total time: 15h 25m 18s. Time estimates for 10 more iterations: 9m 15s, 100 more iterations: 1h 32m 31s, 500 more iterations: 7h 42m 39s.
[2025-08-20 19:48:02,078][__main__][INFO] - Starting iteration 750.
[2025-08-20 19:48:25,245][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 19:48:25,246][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 19:48:25,252][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 19:48:27,694][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 19:48:27,695][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 19:48:27,702][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 19:48:27,704][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data.
[2025-08-20 19:48:27,705][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets.
[2025-08-20 19:48:28,012][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:48:28,802][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:48:29,596][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:48:30,387][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:48:31,181][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:48:31,976][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:48:32,768][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:48:33,561][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:48:34,358][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:48:35,153][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:48:35,946][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:48:36,739][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:48:37,534][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:48:38,327][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:48:39,120][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:48:39,916][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:48:40,711][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:48:41,503][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:48:42,298][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:48:43,552][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:48:44,345][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:48:45,137][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:48:45,932][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:48:46,725][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:48:47,518][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:48:48,315][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:48:49,110][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:48:49,903][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:48:50,698][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:48:51,494][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:48:52,291][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:48:53,085][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:48:54,742][mllm.training.trainer_common][INFO] - For task: Apply reinforce step
,
            ΔVRAM Allocated: 2.288818359375e-05 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:26,
            Percentage of VRAM taken: 69.91579512747062%,
            
[2025-08-20 19:48:55,698][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/policy_optimizer_state.pt
[2025-08-20 19:48:55,699][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/critic_optimizer_state.pt
[2025-08-20 19:48:59,770][__main__][INFO] - Iteration 751 took 57s (35.93% Gen, 59.28% Train). Generation: 20s, Training: 34s. Estimated remaining time: 4h 23m 3s. Estimated total time: 16h 1m 31s. Time estimates for 10 more iterations: 9m 36s, 100 more iterations: 1h 36m 9s, 500 more iterations: 8h 0m 45s.
[2025-08-20 19:48:59,772][__main__][INFO] - Starting iteration 751.
[2025-08-20 19:49:22,848][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 19:49:22,849][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 19:49:22,855][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 19:49:25,291][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 19:49:25,292][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 19:49:25,298][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 19:49:25,301][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data.
[2025-08-20 19:49:25,301][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets.
[2025-08-20 19:49:25,600][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:49:26,389][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:49:27,181][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:49:27,973][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:49:28,765][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:49:29,560][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:49:30,353][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:49:31,145][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:49:31,941][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:49:32,736][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:49:33,528][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:49:34,321][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:49:35,117][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:49:35,909][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:49:36,702][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:49:37,498][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:49:38,292][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:49:39,085][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:49:39,881][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:49:40,677][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:49:41,472][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:49:42,266][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:49:43,060][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:49:43,858][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:49:45,156][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:49:45,951][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:49:46,745][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:49:47,540][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:49:48,337][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:49:49,135][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:49:49,931][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:49:50,726][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:49:52,362][mllm.training.trainer_common][INFO] - For task: Apply reinforce step
,
            ΔVRAM Allocated: 2.288818359375e-05 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:27,
            Percentage of VRAM taken: 69.91579512747062%,
            
[2025-08-20 19:49:53,301][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/policy_optimizer_state.pt
[2025-08-20 19:49:53,303][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/critic_optimizer_state.pt
[2025-08-20 19:49:54,859][__main__][INFO] - Iteration 752 took 55s (37.46% Gen, 62.54% Train). Generation: 20s, Training: 34s. Estimated remaining time: 3h 38m 42s. Estimated total time: 15h 18m 6s. Time estimates for 10 more iterations: 9m 10s, 100 more iterations: 1h 31m 48s, 500 more iterations: 7h 39m 3s.
[2025-08-20 19:49:54,860][__main__][INFO] - Starting iteration 752.
[2025-08-20 19:50:18,241][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 19:50:18,242][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 19:50:18,248][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 19:50:20,716][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 19:50:20,718][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 19:50:20,724][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 19:50:20,726][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data.
[2025-08-20 19:50:20,727][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets.
[2025-08-20 19:50:21,026][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:50:21,816][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:50:22,605][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:50:23,397][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:50:24,188][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:50:24,980][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:50:25,771][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:50:26,563][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:50:27,355][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:50:28,145][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:50:28,940][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:50:29,732][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:50:30,523][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:50:31,316][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:50:32,108][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:50:32,902][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:50:33,698][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:50:34,489][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:50:35,282][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:50:36,076][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:50:36,869][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:50:38,198][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:50:38,993][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:50:39,787][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:50:40,584][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:50:41,382][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:50:42,181][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:50:42,976][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:50:43,772][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:50:44,567][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:50:45,363][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:50:46,162][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:50:47,800][mllm.training.trainer_common][INFO] - For task: Apply reinforce step
,
            ΔVRAM Allocated: 2.288818359375e-05 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:27,
            Percentage of VRAM taken: 69.91579512747062%,
            
[2025-08-20 19:50:48,802][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/policy_optimizer_state.pt
[2025-08-20 19:50:48,805][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/critic_optimizer_state.pt
[2025-08-20 19:50:50,038][__main__][INFO] - Iteration 753 took 55s (37.93% Gen, 62.07% Train). Generation: 20s, Training: 34s. Estimated remaining time: 3h 39m 17s. Estimated total time: 15h 19m 36s. Time estimates for 10 more iterations: 9m 11s, 100 more iterations: 1h 31m 57s, 500 more iterations: 7h 39m 48s.
[2025-08-20 19:50:50,039][__main__][INFO] - Starting iteration 753.
[2025-08-20 19:51:13,107][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 19:51:13,109][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 19:51:13,115][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 19:51:15,567][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 19:51:15,568][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 19:51:15,575][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 19:51:15,577][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data.
[2025-08-20 19:51:15,578][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets.
[2025-08-20 19:51:15,875][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:51:16,666][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:51:17,459][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:51:18,251][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:51:19,046][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:51:19,840][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:51:20,632][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:51:21,426][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:51:22,221][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:51:23,013][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:51:23,807][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:51:24,603][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:51:25,396][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:51:26,189][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:51:26,986][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:51:27,781][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:51:28,576][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:51:29,370][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:51:30,165][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:51:30,962][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:51:31,757][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:51:32,551][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:51:33,346][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:51:34,143][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:51:34,940][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:51:35,735][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:51:37,027][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:51:37,824][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:51:38,622][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:51:39,419][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:51:40,216][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:51:41,011][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:51:42,661][mllm.training.trainer_common][INFO] - For task: Apply reinforce step
,
            ΔVRAM Allocated: 2.288818359375e-05 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:27,
            Percentage of VRAM taken: 69.91579512747062%,
            
[2025-08-20 19:51:43,587][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/policy_optimizer_state.pt
[2025-08-20 19:51:43,588][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/critic_optimizer_state.pt
[2025-08-20 19:51:44,880][__main__][INFO] - Iteration 754 took 54s (37.62% Gen, 62.37% Train). Generation: 20s, Training: 34s. Estimated remaining time: 3h 32m 46s. Estimated total time: 15h 14m 0s. Time estimates for 10 more iterations: 9m 8s, 100 more iterations: 1h 31m 24s, 500 more iterations: 7h 37m 0s.
[2025-08-20 19:51:44,882][__main__][INFO] - Starting iteration 754.
[2025-08-20 19:52:07,892][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 19:52:07,894][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 19:52:07,900][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 19:52:10,344][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 19:52:10,345][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 19:52:10,351][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 19:52:10,353][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data.
[2025-08-20 19:52:10,354][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets.
[2025-08-20 19:52:10,652][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:52:11,444][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:52:12,235][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:52:13,029][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:52:13,821][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:52:14,614][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:52:15,411][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:52:16,206][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:52:16,999][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:52:17,792][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:52:18,588][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:52:19,381][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:52:20,174][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:52:20,969][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:52:21,764][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:52:22,559][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:52:23,353][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:52:24,149][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:52:25,421][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:52:26,214][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:52:27,009][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:52:27,804][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:52:28,598][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:52:29,393][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:52:30,189][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:52:30,986][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:52:31,784][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:52:32,580][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:52:33,375][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:52:34,170][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:52:34,967][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:52:35,765][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:52:37,396][mllm.training.trainer_common][INFO] - For task: Apply reinforce step
,
            ΔVRAM Allocated: 2.288818359375e-05 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:27,
            Percentage of VRAM taken: 69.91579512747062%,
            
[2025-08-20 19:52:38,340][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/policy_optimizer_state.pt
[2025-08-20 19:52:38,342][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/critic_optimizer_state.pt
[2025-08-20 19:52:39,646][__main__][INFO] - Iteration 755 took 54s (37.58% Gen, 62.42% Train). Generation: 20s, Training: 34s. Estimated remaining time: 3h 30m 35s. Estimated total time: 15h 12m 44s. Time estimates for 10 more iterations: 9m 7s, 100 more iterations: 1h 31m 16s, 500 more iterations: 7h 36m 22s.
[2025-08-20 19:52:39,648][__main__][INFO] - Starting iteration 755.
[2025-08-20 19:53:03,220][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 19:53:03,221][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 19:53:03,227][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 19:53:05,693][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 19:53:05,695][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 19:53:05,701][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 19:53:05,703][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data.
[2025-08-20 19:53:05,704][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets.
[2025-08-20 19:53:06,002][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:53:06,793][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:53:07,587][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:53:08,379][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:53:09,174][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:53:09,969][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:53:10,762][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:53:11,555][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:53:12,350][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:53:13,143][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:53:13,936][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:53:14,732][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:53:15,527][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:53:16,321][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:53:17,115][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:53:17,909][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:53:18,703][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:53:19,498][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:53:20,293][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:53:21,091][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:53:21,890][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:53:22,687][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:53:23,484][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:53:24,279][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:53:25,074][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:53:25,870][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:53:27,283][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:53:28,077][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:53:28,872][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:53:29,668][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:53:30,466][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:53:31,262][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:53:32,894][mllm.training.trainer_common][INFO] - For task: Apply reinforce step
,
            ΔVRAM Allocated: 2.288818359375e-05 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:27,
            Percentage of VRAM taken: 69.91579512747062%,
            
[2025-08-20 19:53:33,876][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/policy_optimizer_state.pt
[2025-08-20 19:53:33,877][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/critic_optimizer_state.pt
[2025-08-20 19:53:35,107][__main__][INFO] - Iteration 756 took 55s (38.06% Gen, 61.94% Train). Generation: 21s, Training: 34s. Estimated remaining time: 3h 41m 15s. Estimated total time: 15h 24m 19s. Time estimates for 10 more iterations: 9m 14s, 100 more iterations: 1h 32m 25s, 500 more iterations: 7h 42m 9s.
[2025-08-20 19:53:35,109][__main__][INFO] - Starting iteration 756.
[2025-08-20 19:53:58,176][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 19:53:58,177][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 19:53:58,183][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 19:54:00,619][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 19:54:00,620][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 19:54:00,626][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 19:54:00,629][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data.
[2025-08-20 19:54:00,629][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets.
[2025-08-20 19:54:00,927][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:54:01,717][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:54:02,510][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:54:03,302][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:54:04,095][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:54:04,887][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:54:05,679][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:54:06,474][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:54:07,269][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:54:08,061][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:54:08,855][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:54:09,650][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:54:10,444][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:54:11,238][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:54:12,034][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:54:12,829][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:54:13,624][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:54:14,418][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:54:15,215][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:54:16,013][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:54:16,810][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:54:18,068][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:54:18,862][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:54:19,656][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:54:20,452][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:54:21,250][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:54:22,046][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:54:22,841][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:54:23,636][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:54:24,433][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:54:25,231][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:54:26,029][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:54:27,700][mllm.training.trainer_common][INFO] - For task: Apply reinforce step
,
            ΔVRAM Allocated: 2.288818359375e-05 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:27,
            Percentage of VRAM taken: 69.91579512747062%,
            
[2025-08-20 19:54:28,611][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/policy_optimizer_state.pt
[2025-08-20 19:54:28,613][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/critic_optimizer_state.pt
[2025-08-20 19:54:29,948][__main__][INFO] - Iteration 757 took 54s (37.61% Gen, 62.38% Train). Generation: 20s, Training: 34s. Estimated remaining time: 3h 29m 59s. Estimated total time: 15h 13m 58s. Time estimates for 10 more iterations: 9m 8s, 100 more iterations: 1h 31m 23s, 500 more iterations: 7h 36m 59s.
[2025-08-20 19:54:29,950][__main__][INFO] - Starting iteration 757.
[2025-08-20 19:54:53,344][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 19:54:53,345][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 19:54:53,352][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 19:54:55,800][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 19:54:55,801][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 19:54:55,808][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 19:54:55,809][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data.
[2025-08-20 19:54:55,810][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets.
[2025-08-20 19:54:56,110][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:54:56,898][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:54:57,691][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:54:58,484][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:54:59,279][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:55:00,073][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:55:00,866][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:55:01,659][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:55:02,455][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:55:03,249][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:55:04,041][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:55:04,836][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:55:05,631][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:55:06,423][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:55:07,218][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:55:08,016][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:55:08,813][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:55:09,607][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:55:10,401][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:55:11,197][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:55:11,995][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:55:12,793][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:55:13,589][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:55:14,891][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:55:15,685][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:55:16,480][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:55:17,276][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:55:18,071][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:55:18,867][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:55:19,662][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:55:20,457][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:55:21,254][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:55:22,868][mllm.training.trainer_common][INFO] - For task: Apply reinforce step
,
            ΔVRAM Allocated: 2.288818359375e-05 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:27,
            Percentage of VRAM taken: 69.91579512747062%,
            
[2025-08-20 19:55:23,808][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/policy_optimizer_state.pt
[2025-08-20 19:55:23,809][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/critic_optimizer_state.pt
[2025-08-20 19:55:25,266][__main__][INFO] - Iteration 758 took 55s (37.84% Gen, 62.16% Train). Generation: 20s, Training: 34s. Estimated remaining time: 3h 37m 1s. Estimated total time: 15h 21m 56s. Time estimates for 10 more iterations: 9m 13s, 100 more iterations: 1h 32m 11s, 500 more iterations: 7h 40m 58s.
[2025-08-20 19:55:25,268][__main__][INFO] - Starting iteration 758.
[2025-08-20 19:55:48,423][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 19:55:48,424][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 19:55:48,430][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 19:55:50,887][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 19:55:50,889][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 19:55:50,895][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 19:55:50,897][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data.
[2025-08-20 19:55:50,898][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets.
[2025-08-20 19:55:51,196][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:55:51,985][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:55:52,779][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:55:53,571][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:55:54,363][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:55:55,159][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:55:55,952][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:55:56,744][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:55:57,540][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:55:58,333][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:55:59,128][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:55:59,923][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:56:00,719][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:56:01,515][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:56:02,308][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:56:03,103][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:56:03,898][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:56:04,694][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:56:05,488][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:56:06,282][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:56:07,080][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:56:07,878][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:56:08,676][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:56:09,984][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:56:10,778][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:56:11,575][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:56:12,369][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:56:13,163][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:56:13,958][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:56:14,756][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:56:15,552][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:56:16,346][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:56:17,995][mllm.training.trainer_common][INFO] - For task: Apply reinforce step
,
            ΔVRAM Allocated: 2.288818359375e-05 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:27,
            Percentage of VRAM taken: 69.91579512747062%,
            
[2025-08-20 19:56:19,013][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/policy_optimizer_state.pt
[2025-08-20 19:56:19,015][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/critic_optimizer_state.pt
[2025-08-20 19:56:20,319][__main__][INFO] - Iteration 759 took 55s (37.64% Gen, 62.35% Train). Generation: 20s, Training: 34s. Estimated remaining time: 3h 31m 41s. Estimated total time: 15h 17m 30s. Time estimates for 10 more iterations: 9m 10s, 100 more iterations: 1h 31m 45s, 500 more iterations: 7h 38m 45s.
[2025-08-20 19:56:20,320][__main__][INFO] - Starting iteration 759.
[2025-08-20 19:56:43,459][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 19:56:43,461][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 19:56:43,467][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 19:56:45,921][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 19:56:45,923][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 19:56:45,929][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 19:56:45,931][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data.
[2025-08-20 19:56:45,932][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets.
[2025-08-20 19:56:46,235][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:56:47,024][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:56:47,818][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:56:48,609][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:56:49,404][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:56:50,198][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:56:50,991][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:56:51,785][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:56:52,581][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:56:53,375][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:56:54,169][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:56:54,963][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:56:55,759][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:56:56,553][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:56:57,347][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:56:58,144][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:56:58,942][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:56:59,739][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:57:00,536][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:57:01,331][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:57:02,602][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:57:03,400][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:57:04,196][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:57:04,991][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:57:05,785][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:57:06,582][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:57:07,381][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:57:08,180][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:57:08,978][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:57:09,775][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:57:10,571][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:57:11,365][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:57:12,995][mllm.training.trainer_common][INFO] - For task: Apply reinforce step
,
            ΔVRAM Allocated: 2.288818359375e-05 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:27,
            Percentage of VRAM taken: 69.91579512747062%,
            
[2025-08-20 19:57:13,990][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/policy_optimizer_state.pt
[2025-08-20 19:57:13,992][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/critic_optimizer_state.pt
[2025-08-20 19:57:15,415][__main__][INFO] - Iteration 760 took 55s (37.56% Gen, 62.44% Train). Generation: 20s, Training: 34s. Estimated remaining time: 3h 31m 29s. Estimated total time: 15h 18m 14s. Time estimates for 10 more iterations: 9m 10s, 100 more iterations: 1h 31m 49s, 500 more iterations: 7h 39m 7s.
[2025-08-20 19:57:15,416][__main__][INFO] - Starting iteration 760.
[2025-08-20 19:57:38,878][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 19:57:38,879][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 19:57:38,885][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 19:57:41,335][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 19:57:41,336][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 19:57:41,343][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 19:57:41,345][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data.
[2025-08-20 19:57:41,345][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets.
[2025-08-20 19:57:41,645][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:57:42,435][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:57:43,228][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:57:44,020][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:57:44,810][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:57:45,605][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:57:46,397][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:57:47,189][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:57:47,986][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:57:48,780][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:57:49,572][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:57:50,365][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:57:51,160][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:57:51,952][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:57:52,747][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:57:53,544][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:57:54,339][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:57:55,132][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:57:56,425][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:57:57,222][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:57:58,018][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:57:58,812][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:57:59,607][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:58:00,404][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:58:01,203][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:58:02,000][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:58:02,796][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:58:03,591][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:58:04,386][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:58:05,183][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:58:05,982][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:58:06,780][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:58:08,407][mllm.training.trainer_common][INFO] - For task: Apply reinforce step
,
            ΔVRAM Allocated: 2.288818359375e-05 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:27,
            Percentage of VRAM taken: 69.91579512747062%,
            
[2025-08-20 19:58:09,374][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/policy_optimizer_state.pt
[2025-08-20 19:58:09,375][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/critic_optimizer_state.pt
[2025-08-20 19:58:10,699][__main__][INFO] - Iteration 761 took 55s (38.03% Gen, 61.97% Train). Generation: 21s, Training: 34s. Estimated remaining time: 3h 33m 42s. Estimated total time: 15h 21m 21s. Time estimates for 10 more iterations: 9m 12s, 100 more iterations: 1h 32m 8s, 500 more iterations: 7h 40m 40s.
[2025-08-20 19:58:10,700][__main__][INFO] - Starting iteration 761.
[2025-08-20 19:58:34,289][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 19:58:34,291][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 19:58:34,297][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 19:58:36,740][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 19:58:36,741][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 19:58:36,748][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 19:58:36,750][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data.
[2025-08-20 19:58:36,750][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets.
[2025-08-20 19:58:37,048][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:58:37,838][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:58:38,631][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:58:39,425][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:58:40,217][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:58:41,011][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:58:41,806][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:58:42,600][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:58:43,393][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:58:44,189][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:58:44,985][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:58:45,781][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:58:46,575][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:58:47,370][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:58:48,167][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:58:48,963][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:58:49,757][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:58:51,067][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:58:51,863][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:58:52,657][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:58:53,451][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:58:54,247][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:58:55,046][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:58:55,844][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:58:56,641][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:58:57,436][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:58:58,230][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:58:59,027][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:58:59,824][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:59:00,622][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:59:01,418][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:59:02,213][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:59:03,864][mllm.training.trainer_common][INFO] - For task: Apply reinforce step
,
            ΔVRAM Allocated: 2.288818359375e-05 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:27,
            Percentage of VRAM taken: 69.91579512747062%,
            
[2025-08-20 19:59:04,801][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/policy_optimizer_state.pt
[2025-08-20 19:59:04,802][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/critic_optimizer_state.pt
[2025-08-20 19:59:06,045][__main__][INFO] - Iteration 762 took 55s (38.21% Gen, 61.79% Train). Generation: 21s, Training: 34s. Estimated remaining time: 3h 33m 49s. Estimated total time: 15h 22m 24s. Time estimates for 10 more iterations: 9m 13s, 100 more iterations: 1h 32m 14s, 500 more iterations: 7h 41m 12s.
[2025-08-20 19:59:06,046][__main__][INFO] - Starting iteration 762.
[2025-08-20 19:59:29,424][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 19:59:29,425][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 19:59:29,431][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 19:59:31,865][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 19:59:31,866][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 19:59:31,872][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 19:59:31,874][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data.
[2025-08-20 19:59:31,875][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets.
[2025-08-20 19:59:32,172][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:59:32,961][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:59:33,752][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:59:34,543][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:59:35,334][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:59:36,128][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:59:36,921][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:59:37,712][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:59:38,507][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:59:39,298][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:59:40,092][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:59:40,888][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:59:41,681][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:59:42,474][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:59:43,271][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:59:44,066][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:59:44,860][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:59:46,086][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:59:46,878][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:59:47,672][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:59:48,469][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:59:49,265][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:59:50,059][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:59:50,852][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:59:51,649][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:59:52,446][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:59:53,240][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:59:54,034][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:59:54,829][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:59:55,627][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:59:56,425][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:59:57,222][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 19:59:58,863][mllm.training.trainer_common][INFO] - For task: Apply reinforce step
,
            ΔVRAM Allocated: 2.288818359375e-05 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:26,
            Percentage of VRAM taken: 69.91579512747062%,
            
[2025-08-20 19:59:59,968][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/policy_optimizer_state.pt
[2025-08-20 19:59:59,970][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/critic_optimizer_state.pt
[2025-08-20 20:00:01,313][__main__][INFO] - Iteration 763 took 55s (37.90% Gen, 62.10% Train). Generation: 20s, Training: 34s. Estimated remaining time: 3h 31m 35s. Estimated total time: 15h 21m 6s. Time estimates for 10 more iterations: 9m 12s, 100 more iterations: 1h 32m 6s, 500 more iterations: 7h 40m 33s.
[2025-08-20 20:00:01,314][__main__][INFO] - Starting iteration 763.
[2025-08-20 20:00:25,007][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 20:00:25,008][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 20:00:25,014][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 20:00:27,463][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 20:00:27,464][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 20:00:27,471][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 20:00:27,473][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data.
[2025-08-20 20:00:27,474][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets.
[2025-08-20 20:00:27,772][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:00:28,562][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:00:29,356][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:00:30,151][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:00:30,944][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:00:31,737][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:00:32,533][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:00:33,327][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:00:34,121][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:00:34,916][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:00:35,714][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:00:36,510][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:00:37,305][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:00:38,099][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:00:38,895][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:00:39,692][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:00:40,489][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:00:41,284][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:00:42,079][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:00:42,875][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:00:43,673][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:00:44,471][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:00:45,268][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:00:46,063][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:00:46,857][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:00:47,653][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:00:48,910][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:00:49,703][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:00:50,497][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:00:51,291][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:00:52,089][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:00:52,885][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:00:54,493][mllm.training.trainer_common][INFO] - For task: Apply reinforce step
,
            ΔVRAM Allocated: 2.288818359375e-05 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:26,
            Percentage of VRAM taken: 69.91579512747062%,
            
[2025-08-20 20:00:55,480][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/policy_optimizer_state.pt
[2025-08-20 20:00:55,481][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/critic_optimizer_state.pt
[2025-08-20 20:00:56,985][__main__][INFO] - Iteration 764 took 55s (38.16% Gen, 61.84% Train). Generation: 21s, Training: 34s. Estimated remaining time: 3h 37m 24s. Estimated total time: 15h 27m 50s. Time estimates for 10 more iterations: 9m 16s, 100 more iterations: 1h 32m 47s, 500 more iterations: 7h 43m 55s.
[2025-08-20 20:00:56,987][__main__][INFO] - Starting iteration 764.
[2025-08-20 20:01:20,099][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 20:01:20,100][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 20:01:20,106][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 20:01:22,574][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 20:01:22,575][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 20:01:22,582][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 20:01:22,584][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data.
[2025-08-20 20:01:22,585][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets.
[2025-08-20 20:01:22,883][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:01:23,675][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:01:24,466][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:01:25,259][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:01:26,056][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:01:26,849][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:01:27,643][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:01:28,439][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:01:29,237][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:01:30,032][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:01:30,826][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:01:31,620][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:01:32,418][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:01:33,215][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:01:34,013][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:01:34,807][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:01:35,601][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:01:36,398][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:01:37,196][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:01:37,994][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:01:38,791][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:01:39,585][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:01:40,381][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:01:41,177][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:01:42,507][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:01:43,301][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:01:44,095][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:01:44,893][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:01:45,691][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:01:46,487][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:01:47,282][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:01:48,077][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:01:49,641][mllm.training.trainer_common][INFO] - For task: Apply reinforce step
,
            ΔVRAM Allocated: 2.288818359375e-05 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:27,
            Percentage of VRAM taken: 69.91579512747062%,
            
[2025-08-20 20:01:50,579][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/policy_optimizer_state.pt
[2025-08-20 20:01:50,580][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/critic_optimizer_state.pt
[2025-08-20 20:01:51,949][__main__][INFO] - Iteration 765 took 54s (37.58% Gen, 62.42% Train). Generation: 20s, Training: 34s. Estimated remaining time: 3h 24m 40s. Estimated total time: 15h 16m 1s. Time estimates for 10 more iterations: 9m 9s, 100 more iterations: 1h 31m 36s, 500 more iterations: 7h 38m 0s.
[2025-08-20 20:01:51,950][__main__][INFO] - Starting iteration 765.
[2025-08-20 20:02:15,395][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 20:02:15,396][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 20:02:15,402][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 20:02:17,848][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 20:02:17,850][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 20:02:17,856][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 20:02:17,858][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data.
[2025-08-20 20:02:17,859][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets.
[2025-08-20 20:02:18,157][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:02:18,946][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:02:19,741][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:02:20,536][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:02:21,328][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:02:22,122][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:02:22,918][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:02:23,713][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:02:24,507][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:02:25,302][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:02:26,100][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:02:26,896][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:02:27,690][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:02:28,485][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:02:29,281][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:02:30,079][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:02:30,876][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:02:31,671][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:02:32,466][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:02:33,261][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:02:34,059][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:02:34,857][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:02:35,654][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:02:36,448][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:02:37,676][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:02:38,472][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:02:39,267][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:02:40,061][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:02:40,858][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:02:41,656][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:02:42,455][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:02:43,251][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:02:44,816][mllm.training.trainer_common][INFO] - For task: Apply reinforce step
,
            ΔVRAM Allocated: 2.288818359375e-05 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:26,
            Percentage of VRAM taken: 69.91579512747062%,
            
[2025-08-20 20:02:45,743][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/policy_optimizer_state.pt
[2025-08-20 20:02:45,744][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/critic_optimizer_state.pt
[2025-08-20 20:02:47,105][__main__][INFO] - Iteration 766 took 55s (38.08% Gen, 61.92% Train). Generation: 21s, Training: 34s. Estimated remaining time: 3h 26m 58s. Estimated total time: 15h 19m 14s. Time estimates for 10 more iterations: 9m 11s, 100 more iterations: 1h 31m 55s, 500 more iterations: 7h 39m 37s.
[2025-08-20 20:02:47,107][__main__][INFO] - Starting iteration 766.
[2025-08-20 20:03:10,138][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 20:03:10,139][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 20:03:10,145][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 20:03:12,613][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 20:03:12,615][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 20:03:12,621][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 20:03:12,623][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data.
[2025-08-20 20:03:12,624][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets.
[2025-08-20 20:03:12,926][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:03:13,718][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:03:14,510][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:03:15,303][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:03:16,097][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:03:16,889][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:03:17,685][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:03:18,481][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:03:19,276][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:03:20,069][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:03:20,864][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:03:21,662][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:03:22,458][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:03:23,252][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:03:24,046][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:03:24,842][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:03:25,640][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:03:26,437][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:03:27,232][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:03:28,027][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:03:28,822][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:03:29,621][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:03:30,420][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:03:31,218][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:03:32,554][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:03:33,346][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:03:34,139][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:03:34,935][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:03:35,728][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:03:36,522][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:03:37,318][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:03:38,112][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:03:39,683][mllm.training.trainer_common][INFO] - For task: Apply reinforce step
,
            ΔVRAM Allocated: 2.288818359375e-05 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:27,
            Percentage of VRAM taken: 69.91579512747062%,
            
[2025-08-20 20:03:40,610][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/policy_optimizer_state.pt
[2025-08-20 20:03:40,612][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/critic_optimizer_state.pt
[2025-08-20 20:03:42,063][__main__][INFO] - Iteration 767 took 54s (37.46% Gen, 62.54% Train). Generation: 20s, Training: 34s. Estimated remaining time: 3h 22m 44s. Estimated total time: 15h 15m 55s. Time estimates for 10 more iterations: 9m 9s, 100 more iterations: 1h 31m 35s, 500 more iterations: 7h 37m 57s.
[2025-08-20 20:03:42,064][__main__][INFO] - Starting iteration 767.
[2025-08-20 20:04:05,533][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 20:04:05,534][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 20:04:05,541][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 20:04:07,991][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 20:04:07,993][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 20:04:07,999][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 20:04:08,002][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data.
[2025-08-20 20:04:08,002][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets.
[2025-08-20 20:04:08,301][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:04:09,092][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:04:09,885][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:04:10,680][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:04:11,474][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:04:12,268][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:04:13,065][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:04:13,860][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:04:14,654][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:04:15,449][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:04:16,247][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:04:17,045][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:04:17,841][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:04:18,635][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:04:19,429][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:04:20,226][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:04:21,024][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:04:21,821][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:04:22,616][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:04:23,410][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:04:24,206][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:04:25,005][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:04:25,804][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:04:26,603][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:04:27,402][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:04:28,200][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:04:28,996][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:04:30,361][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:04:31,156][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:04:31,949][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:04:32,741][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:04:33,536][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:04:35,140][mllm.training.trainer_common][INFO] - For task: Apply reinforce step
,
            ΔVRAM Allocated: 2.288818359375e-05 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:27,
            Percentage of VRAM taken: 69.91579512747062%,
            
[2025-08-20 20:04:36,093][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/policy_optimizer_state.pt
[2025-08-20 20:04:36,094][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/critic_optimizer_state.pt
[2025-08-20 20:04:37,425][__main__][INFO] - Iteration 768 took 55s (38.00% Gen, 62.00% Train). Generation: 21s, Training: 34s. Estimated remaining time: 3h 28m 34s. Estimated total time: 15h 22m 40s. Time estimates for 10 more iterations: 9m 13s, 100 more iterations: 1h 32m 16s, 500 more iterations: 7h 41m 20s.
[2025-08-20 20:04:37,427][__main__][INFO] - Starting iteration 768.
[2025-08-20 20:05:00,469][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 20:05:00,471][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 20:05:00,477][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 20:05:02,942][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 20:05:02,944][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 20:05:02,950][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 20:05:02,952][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data.
[2025-08-20 20:05:02,953][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets.
[2025-08-20 20:05:03,251][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:05:04,041][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:05:04,831][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:05:05,623][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:05:06,414][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:05:07,207][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:05:07,999][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:05:08,790][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:05:09,586][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:05:10,378][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:05:11,170][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:05:11,966][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:05:12,759][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:05:13,551][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:05:14,347][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:05:15,140][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:05:15,933][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:05:16,727][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:05:17,522][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:05:18,316][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:05:19,631][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:05:20,425][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:05:21,219][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:05:22,012][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:05:22,806][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:05:23,603][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:05:24,398][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:05:25,191][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:05:25,985][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:05:26,783][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:05:27,578][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:05:28,371][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:05:29,951][mllm.training.trainer_common][INFO] - For task: Apply reinforce step
,
            ΔVRAM Allocated: 2.288818359375e-05 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:26,
            Percentage of VRAM taken: 69.91579512747062%,
            
[2025-08-20 20:05:30,881][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/policy_optimizer_state.pt
[2025-08-20 20:05:30,883][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/critic_optimizer_state.pt
[2025-08-20 20:05:32,252][__main__][INFO] - Iteration 769 took 54s (37.54% Gen, 62.46% Train). Generation: 20s, Training: 34s. Estimated remaining time: 3h 18m 43s. Estimated total time: 15h 13m 44s. Time estimates for 10 more iterations: 9m 8s, 100 more iterations: 1h 31m 22s, 500 more iterations: 7h 36m 52s.
[2025-08-20 20:05:32,254][__main__][INFO] - Starting iteration 769.
[2025-08-20 20:05:55,365][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 20:05:55,366][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 20:05:55,372][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 20:05:57,823][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 20:05:57,824][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 20:05:57,831][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 20:05:57,833][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data.
[2025-08-20 20:05:57,833][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets.
[2025-08-20 20:05:58,132][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:05:58,921][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:05:59,713][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:06:00,504][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:06:01,295][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:06:02,089][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:06:02,881][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:06:03,672][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:06:04,468][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:06:05,259][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:06:06,052][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:06:06,847][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:06:07,640][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:06:08,432][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:06:09,229][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:06:10,023][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:06:10,816][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:06:11,610][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:06:12,407][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:06:13,201][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:06:13,995][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:06:14,789][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:06:15,586][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:06:16,379][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:06:17,172][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:06:17,968][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:06:18,765][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:06:20,069][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:06:20,865][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:06:21,661][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:06:22,456][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:06:23,250][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:06:24,827][mllm.training.trainer_common][INFO] - For task: Apply reinforce step
,
            ΔVRAM Allocated: 2.288818359375e-05 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:26,
            Percentage of VRAM taken: 69.91579512747062%,
            
[2025-08-20 20:06:25,751][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/policy_optimizer_state.pt
[2025-08-20 20:06:25,752][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/critic_optimizer_state.pt
[2025-08-20 20:06:27,065][__main__][INFO] - Iteration 770 took 54s (37.72% Gen, 62.28% Train). Generation: 20s, Training: 34s. Estimated remaining time: 3h 17m 35s. Estimated total time: 15h 13m 31s. Time estimates for 10 more iterations: 9m 8s, 100 more iterations: 1h 31m 21s, 500 more iterations: 7h 36m 45s.
[2025-08-20 20:06:27,067][__main__][INFO] - Starting iteration 770.
[2025-08-20 20:06:50,149][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 20:06:50,150][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 20:06:50,156][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 20:06:52,575][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 20:06:52,577][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 20:06:52,583][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 20:06:52,585][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data.
[2025-08-20 20:06:52,586][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets.
[2025-08-20 20:06:52,884][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:06:53,674][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:06:54,467][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:06:55,258][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:06:56,054][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:06:56,850][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:06:57,644][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:06:58,437][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:06:59,235][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:07:00,031][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:07:00,826][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:07:01,620][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:07:02,416][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:07:03,213][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:07:04,011][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:07:04,807][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:07:05,601][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:07:06,396][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:07:07,194][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:07:07,992][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:07:08,790][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:07:10,178][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:07:10,973][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:07:11,770][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:07:12,568][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:07:13,364][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:07:14,160][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:07:14,954][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:07:15,752][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:07:16,551][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:07:17,349][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:07:18,144][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:07:19,746][mllm.training.trainer_common][INFO] - For task: Apply reinforce step
,
            ΔVRAM Allocated: 2.288818359375e-05 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:27,
            Percentage of VRAM taken: 69.91579512747062%,
            
[2025-08-20 20:07:20,705][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/policy_optimizer_state.pt
[2025-08-20 20:07:20,706][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/critic_optimizer_state.pt
[2025-08-20 20:07:21,997][__main__][INFO] - Iteration 771 took 54s (37.60% Gen, 62.39% Train). Generation: 20s, Training: 34s. Estimated remaining time: 3h 18m 39s. Estimated total time: 15h 15m 30s. Time estimates for 10 more iterations: 9m 9s, 100 more iterations: 1h 31m 33s, 500 more iterations: 7h 37m 45s.
[2025-08-20 20:07:21,999][__main__][INFO] - Starting iteration 771.
[2025-08-20 20:07:45,403][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 20:07:45,404][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 20:07:45,410][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 20:07:47,877][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 20:07:47,879][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 20:07:47,885][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 20:07:47,888][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data.
[2025-08-20 20:07:47,889][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets.
[2025-08-20 20:07:48,188][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:07:48,978][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:07:49,771][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:07:50,564][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:07:51,358][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:07:52,154][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:07:52,947][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:07:53,740][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:07:54,537][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:07:55,334][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:07:56,130][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:07:56,924][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:07:57,719][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:07:58,515][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:07:59,311][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:08:00,106][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:08:00,900][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:08:01,697][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:08:02,496][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:08:03,293][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:08:04,090][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:08:04,885][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:08:05,680][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:08:06,934][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:08:07,733][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:08:08,530][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:08:09,325][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:08:10,120][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:08:10,916][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:08:11,713][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:08:12,509][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:08:13,303][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:08:14,881][mllm.training.trainer_common][INFO] - For task: Apply reinforce step
,
            ΔVRAM Allocated: 2.288818359375e-05 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:26,
            Percentage of VRAM taken: 69.91579512747062%,
            
[2025-08-20 20:08:15,941][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/policy_optimizer_state.pt
[2025-08-20 20:08:15,943][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/critic_optimizer_state.pt
[2025-08-20 20:08:17,475][__main__][INFO] - Iteration 772 took 55s (37.78% Gen, 62.22% Train). Generation: 20s, Training: 34s. Estimated remaining time: 3h 26m 49s. Estimated total time: 15h 24m 35s. Time estimates for 10 more iterations: 9m 14s, 100 more iterations: 1h 32m 27s, 500 more iterations: 7h 42m 17s.
[2025-08-20 20:08:17,476][__main__][INFO] - Starting iteration 772.
[2025-08-20 20:08:40,502][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 20:08:40,504][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 20:08:40,510][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 20:08:42,979][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 20:08:42,980][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 20:08:42,987][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 20:08:42,989][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data.
[2025-08-20 20:08:42,990][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets.
[2025-08-20 20:08:43,289][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:08:44,078][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:08:44,868][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:08:45,660][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:08:46,454][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:08:47,246][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:08:48,040][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:08:48,834][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:08:49,627][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:08:50,423][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:08:51,219][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:08:52,014][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:08:52,807][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:08:53,601][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:08:54,399][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:08:55,194][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:08:55,988][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:08:57,320][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:08:58,118][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:08:58,916][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:08:59,713][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:09:00,508][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:09:01,303][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:09:02,100][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:09:02,896][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:09:03,693][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:09:04,489][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:09:05,285][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:09:06,079][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:09:06,875][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:09:07,672][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:09:08,466][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:09:10,041][mllm.training.trainer_common][INFO] - For task: Apply reinforce step
,
            ΔVRAM Allocated: 2.288818359375e-05 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:27,
            Percentage of VRAM taken: 69.91579512747062%,
            
[2025-08-20 20:09:10,968][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/policy_optimizer_state.pt
[2025-08-20 20:09:10,969][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/critic_optimizer_state.pt
[2025-08-20 20:09:12,297][__main__][INFO] - Iteration 773 took 54s (37.56% Gen, 62.44% Train). Generation: 20s, Training: 34s. Estimated remaining time: 3h 14m 59s. Estimated total time: 15h 13m 40s. Time estimates for 10 more iterations: 9m 8s, 100 more iterations: 1h 31m 22s, 500 more iterations: 7h 36m 50s.
[2025-08-20 20:09:12,298][__main__][INFO] - Starting iteration 773.
[2025-08-20 20:09:35,331][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 20:09:35,332][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 20:09:35,338][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 20:09:37,783][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 20:09:37,784][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 20:09:37,791][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 20:09:37,793][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data.
[2025-08-20 20:09:37,793][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets.
[2025-08-20 20:09:38,092][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:09:38,883][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:09:39,676][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:09:40,469][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:09:41,265][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:09:42,062][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:09:42,856][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:09:43,650][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:09:44,445][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:09:45,243][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:09:46,041][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:09:46,835][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:09:47,629][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:09:48,425][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:09:49,223][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:09:50,021][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:09:50,819][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:09:51,615][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:09:52,409][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:09:53,725][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:09:54,521][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:09:55,317][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:09:56,112][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:09:56,906][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:09:57,701][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:09:58,498][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:09:59,296][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:10:00,090][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:10:00,884][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:10:01,677][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:10:02,471][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:10:03,264][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:10:04,873][mllm.training.trainer_common][INFO] - For task: Apply reinforce step
,
            ΔVRAM Allocated: 2.288818359375e-05 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:27,
            Percentage of VRAM taken: 69.91579512747062%,
            
[2025-08-20 20:10:05,835][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/policy_optimizer_state.pt
[2025-08-20 20:10:05,836][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/critic_optimizer_state.pt
[2025-08-20 20:10:08,195][__main__][INFO] - Iteration 774 took 55s (36.84% Gen, 63.16% Train). Generation: 20s, Training: 35s. Estimated remaining time: 3h 31m 58s. Estimated total time: 15h 31m 36s. Time estimates for 10 more iterations: 9m 18s, 100 more iterations: 1h 33m 9s, 500 more iterations: 7h 45m 48s.
[2025-08-20 20:10:08,196][__main__][INFO] - Starting iteration 774.
[2025-08-20 20:10:31,689][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 20:10:31,691][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 20:10:31,697][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 20:10:34,145][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 20:10:34,146][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 20:10:34,152][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 20:10:34,155][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data.
[2025-08-20 20:10:34,156][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets.
[2025-08-20 20:10:34,461][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:10:35,250][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:10:36,045][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:10:36,838][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:10:37,630][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:10:38,426][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:10:39,222][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:10:40,015][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:10:40,809][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:10:41,605][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:10:42,401][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:10:43,195][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:10:43,989][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:10:44,784][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:10:45,581][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:10:46,374][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:10:47,168][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:10:47,964][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:10:48,761][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:10:49,558][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:10:50,352][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:10:51,147][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:10:52,450][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:10:53,245][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:10:54,042][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:10:54,840][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:10:55,634][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:10:56,427][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:10:57,222][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:10:58,019][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:10:58,815][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:10:59,610][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:11:01,183][mllm.training.trainer_common][INFO] - For task: Apply reinforce step
,
            ΔVRAM Allocated: 2.288818359375e-05 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:26,
            Percentage of VRAM taken: 69.91579512747062%,
            
[2025-08-20 20:11:02,088][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/policy_optimizer_state.pt
[2025-08-20 20:11:02,090][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/critic_optimizer_state.pt
[2025-08-20 20:11:03,463][__main__][INFO] - Iteration 775 took 55s (38.08% Gen, 61.91% Train). Generation: 21s, Training: 34s. Estimated remaining time: 3h 20m 33s. Estimated total time: 15h 21m 6s. Time estimates for 10 more iterations: 9m 12s, 100 more iterations: 1h 32m 6s, 500 more iterations: 7h 40m 33s.
[2025-08-20 20:11:03,465][__main__][INFO] - Starting iteration 775.
[2025-08-20 20:11:26,718][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 20:11:26,719][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 20:11:26,726][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 20:11:29,192][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 20:11:29,193][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 20:11:29,200][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 20:11:29,202][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data.
[2025-08-20 20:11:29,202][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets.
[2025-08-20 20:11:29,503][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:11:30,294][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:11:31,089][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:11:31,884][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:11:32,677][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:11:33,471][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:11:34,269][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:11:35,065][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:11:35,860][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:11:36,653][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:11:37,449][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:11:38,247][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:11:39,043][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:11:39,838][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:11:40,631][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:11:41,428][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:11:42,227][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:11:43,568][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:11:44,364][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:11:45,159][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:11:45,953][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:11:46,747][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:11:47,545][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:11:48,342][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:11:49,137][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:11:49,930][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:11:50,724][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:11:51,521][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:11:52,314][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:11:53,107][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:11:53,903][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:11:54,700][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:11:56,272][mllm.training.trainer_common][INFO] - For task: Apply reinforce step
,
            ΔVRAM Allocated: 2.288818359375e-05 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:27,
            Percentage of VRAM taken: 69.91579512747062%,
            
[2025-08-20 20:11:57,183][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/policy_optimizer_state.pt
[2025-08-20 20:11:57,184][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/critic_optimizer_state.pt
[2025-08-20 20:11:58,445][__main__][INFO] - Iteration 776 took 54s (37.82% Gen, 62.18% Train). Generation: 20s, Training: 34s. Estimated remaining time: 3h 14m 52s. Estimated total time: 15h 16m 19s. Time estimates for 10 more iterations: 9m 9s, 100 more iterations: 1h 31m 37s, 500 more iterations: 7h 38m 9s.
[2025-08-20 20:11:58,446][__main__][INFO] - Starting iteration 776.
[2025-08-20 20:12:22,079][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 20:12:22,081][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 20:12:22,087][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 20:12:24,540][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 20:12:24,541][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 20:12:24,548][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 20:12:24,550][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data.
[2025-08-20 20:12:24,550][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets.
[2025-08-20 20:12:24,849][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:12:25,640][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:12:26,434][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:12:27,231][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:12:28,026][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:12:28,820][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:12:29,615][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:12:30,412][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:12:31,210][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:12:32,006][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:12:32,801][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:12:33,596][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:12:34,393][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:12:35,191][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:12:35,988][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:12:36,785][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:12:37,581][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:12:38,375][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:12:39,172][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:12:39,970][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:12:40,769][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:12:41,568][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:12:42,795][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:12:43,589][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:12:44,385][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:12:45,182][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:12:45,975][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:12:46,769][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:12:47,566][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:12:48,364][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:12:49,160][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:12:49,954][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:12:51,564][mllm.training.trainer_common][INFO] - For task: Apply reinforce step
,
            ΔVRAM Allocated: 2.288818359375e-05 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:26,
            Percentage of VRAM taken: 69.91579512747062%,
            
[2025-08-20 20:12:52,533][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/policy_optimizer_state.pt
[2025-08-20 20:12:52,534][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/critic_optimizer_state.pt
[2025-08-20 20:12:54,290][__main__][INFO] - Iteration 777 took 55s (37.93% Gen, 62.07% Train). Generation: 21s, Training: 34s. Estimated remaining time: 3h 28m 20s. Estimated total time: 15h 30m 43s. Time estimates for 10 more iterations: 9m 18s, 100 more iterations: 1h 33m 4s, 500 more iterations: 7h 45m 21s.
[2025-08-20 20:12:54,292][__main__][INFO] - Starting iteration 777.
[2025-08-20 20:13:18,601][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 20:13:18,603][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 20:13:18,609][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 20:13:21,074][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 20:13:21,075][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 20:13:21,082][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 20:13:21,084][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data.
[2025-08-20 20:13:21,085][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets.
[2025-08-20 20:13:21,384][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:13:22,175][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:13:22,971][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:13:23,764][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:13:24,558][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:13:25,354][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:13:26,151][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:13:26,947][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:13:27,741][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:13:28,536][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:13:29,334][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:13:30,131][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:13:30,927][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:13:31,721][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:13:32,516][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:13:33,313][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:13:34,112][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:13:34,909][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:13:35,704][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:13:36,499][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:13:37,295][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:13:38,090][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:13:38,885][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:13:39,679][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:13:40,918][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:13:41,712][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:13:42,508][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:13:43,304][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:13:44,097][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:13:44,891][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:13:45,688][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:13:46,484][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:13:48,073][mllm.training.trainer_common][INFO] - For task: Apply reinforce step
,
            ΔVRAM Allocated: 2.288818359375e-05 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:26,
            Percentage of VRAM taken: 69.91579512747062%,
            
[2025-08-20 20:13:48,987][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/policy_optimizer_state.pt
[2025-08-20 20:13:48,988][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/critic_optimizer_state.pt
[2025-08-20 20:13:50,321][__main__][INFO] - Iteration 778 took 56s (38.99% Gen, 61.01% Train). Generation: 21s, Training: 34s. Estimated remaining time: 3h 30m 29s. Estimated total time: 15h 33m 48s. Time estimates for 10 more iterations: 9m 20s, 100 more iterations: 1h 33m 22s, 500 more iterations: 7h 46m 54s.
[2025-08-20 20:13:50,322][__main__][INFO] - Starting iteration 778.
[2025-08-20 20:14:13,547][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 20:14:13,549][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 20:14:13,556][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 20:14:16,054][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 20:14:16,056][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 20:14:16,062][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 20:14:16,065][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data.
[2025-08-20 20:14:16,065][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets.
[2025-08-20 20:14:16,364][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:14:17,156][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:14:17,950][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:14:18,743][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:14:19,538][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:14:20,336][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:14:21,131][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:14:21,926][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:14:22,720][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:14:23,517][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:14:24,314][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:14:25,110][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:14:25,903][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:14:26,697][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:14:27,496][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:14:28,294][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:14:29,090][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:14:29,885][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:14:30,680][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:14:31,475][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:14:32,272][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:14:33,619][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:14:34,413][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:14:35,209][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:14:36,001][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:14:36,795][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:14:37,591][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:14:38,387][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:14:39,181][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:14:39,975][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:14:40,771][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:14:41,568][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:14:43,156][mllm.training.trainer_common][INFO] - For task: Apply reinforce step
,
            ΔVRAM Allocated: 2.288818359375e-05 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:27,
            Percentage of VRAM taken: 69.91579512747062%,
            
[2025-08-20 20:14:44,143][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/policy_optimizer_state.pt
[2025-08-20 20:14:44,144][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/critic_optimizer_state.pt
[2025-08-20 20:14:45,373][__main__][INFO] - Iteration 779 took 55s (37.64% Gen, 62.36% Train). Generation: 20s, Training: 34s. Estimated remaining time: 3h 13m 15s. Estimated total time: 15h 17m 30s. Time estimates for 10 more iterations: 9m 10s, 100 more iterations: 1h 31m 45s, 500 more iterations: 7h 38m 45s.
[2025-08-20 20:14:45,375][__main__][INFO] - Starting iteration 779.
[2025-08-20 20:15:08,751][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 20:15:08,753][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 20:15:08,759][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 20:15:11,194][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 20:15:11,195][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 20:15:11,202][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 20:15:11,204][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data.
[2025-08-20 20:15:11,205][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets.
[2025-08-20 20:15:11,503][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:15:12,297][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:15:13,089][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:15:13,881][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:15:14,677][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:15:15,471][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:15:16,264][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:15:17,059][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:15:17,855][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:15:18,649][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:15:19,443][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:15:20,239][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:15:21,036][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:15:21,832][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:15:22,626][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:15:23,420][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:15:24,218][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:15:25,016][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:15:25,813][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:15:26,608][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:15:27,400][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:15:28,196][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:15:29,416][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:15:30,208][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:15:31,000][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:15:31,795][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:15:32,589][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:15:33,382][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:15:34,175][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:15:34,972][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:15:35,767][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:15:36,560][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:15:38,166][mllm.training.trainer_common][INFO] - For task: Apply reinforce step
,
            ΔVRAM Allocated: 2.288818359375e-05 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:26,
            Percentage of VRAM taken: 69.91579512747062%,
            
[2025-08-20 20:15:39,134][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/policy_optimizer_state.pt
[2025-08-20 20:15:39,135][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/critic_optimizer_state.pt
[2025-08-20 20:15:40,636][__main__][INFO] - Iteration 780 took 55s (37.90% Gen, 62.10% Train). Generation: 20s, Training: 34s. Estimated remaining time: 3h 15m 51s. Estimated total time: 15h 21m 1s. Time estimates for 10 more iterations: 9m 12s, 100 more iterations: 1h 32m 6s, 500 more iterations: 7h 40m 30s.
[2025-08-20 20:15:40,638][__main__][INFO] - Starting iteration 780.
[2025-08-20 20:16:04,162][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 20:16:04,163][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 20:16:04,170][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 20:16:06,643][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 20:16:06,644][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 20:16:06,651][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 20:16:06,654][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data.
[2025-08-20 20:16:06,654][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets.
[2025-08-20 20:16:06,959][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:16:07,748][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:16:08,542][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:16:09,334][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:16:10,127][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:16:10,922][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:16:11,717][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:16:12,509][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:16:13,303][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:16:14,100][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:16:14,895][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:16:15,688][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:16:16,482][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:16:17,279][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:16:18,075][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:16:18,868][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:16:19,663][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:16:20,459][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:16:21,254][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:16:22,046][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:16:22,839][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:16:23,634][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:16:24,426][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:16:25,219][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:16:26,016][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:16:26,810][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:16:27,604][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:16:28,902][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:16:29,698][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:16:30,495][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:16:31,291][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:16:32,085][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:16:33,654][mllm.training.trainer_common][INFO] - For task: Apply reinforce step
,
            ΔVRAM Allocated: 2.288818359375e-05 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:26,
            Percentage of VRAM taken: 69.91579512747062%,
            
[2025-08-20 20:16:34,628][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/policy_optimizer_state.pt
[2025-08-20 20:16:34,630][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/critic_optimizer_state.pt
[2025-08-20 20:16:36,035][__main__][INFO] - Iteration 781 took 55s (38.04% Gen, 61.96% Train). Generation: 21s, Training: 34s. Estimated remaining time: 3h 17m 11s. Estimated total time: 15h 23m 16s. Time estimates for 10 more iterations: 9m 13s, 100 more iterations: 1h 32m 19s, 500 more iterations: 7h 41m 38s.
[2025-08-20 20:16:36,037][__main__][INFO] - Starting iteration 781.
[2025-08-20 20:16:59,612][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 20:16:59,614][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 20:16:59,621][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 20:17:02,080][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 20:17:02,081][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 20:17:02,087][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 20:17:02,090][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data.
[2025-08-20 20:17:02,090][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets.
[2025-08-20 20:17:02,389][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:17:03,183][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:17:03,975][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:17:04,768][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:17:05,566][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:17:06,361][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:17:07,155][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:17:07,948][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:17:08,745][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:17:09,542][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:17:10,339][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:17:11,133][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:17:11,927][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:17:12,725][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:17:13,523][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:17:14,319][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:17:15,114][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:17:15,907][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:17:16,703][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:17:17,499][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:17:18,789][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:17:19,582][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:17:20,379][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:17:21,174][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:17:21,969][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:17:22,763][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:17:23,560][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:17:24,357][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:17:25,153][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:17:25,947][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:17:26,741][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:17:27,539][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:17:29,168][mllm.training.trainer_common][INFO] - For task: Apply reinforce step
,
            ΔVRAM Allocated: 2.288818359375e-05 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:27,
            Percentage of VRAM taken: 69.91579512747062%,
            
[2025-08-20 20:17:30,578][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/policy_optimizer_state.pt
[2025-08-20 20:17:30,580][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/critic_optimizer_state.pt
[2025-08-20 20:17:31,846][__main__][INFO] - Iteration 782 took 55s (37.86% Gen, 62.14% Train). Generation: 21s, Training: 34s. Estimated remaining time: 3h 23m 8s. Estimated total time: 15h 30m 9s. Time estimates for 10 more iterations: 9m 18s, 100 more iterations: 1h 33m 0s, 500 more iterations: 7h 45m 4s.
[2025-08-20 20:17:31,847][__main__][INFO] - Starting iteration 782.
[2025-08-20 20:17:55,400][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 20:17:55,402][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 20:17:55,408][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 20:17:57,858][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 20:17:57,859][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 20:17:57,866][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 20:17:57,868][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data.
[2025-08-20 20:17:57,869][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets.
[2025-08-20 20:17:58,167][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:17:58,959][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:17:59,752][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:18:00,547][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:18:01,343][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:18:02,136][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:18:02,930][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:18:03,729][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:18:04,526][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:18:05,322][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:18:06,116][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:18:06,910][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:18:07,707][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:18:08,504][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:18:09,299][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:18:10,092][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:18:10,886][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:18:11,681][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:18:12,474][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:18:13,266][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:18:14,565][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:18:15,362][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:18:16,157][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:18:16,951][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:18:17,745][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:18:18,542][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:18:19,338][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:18:20,132][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:18:20,925][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:18:21,722][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:18:22,519][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:18:23,313][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:18:24,882][mllm.training.trainer_common][INFO] - For task: Apply reinforce step
,
            ΔVRAM Allocated: 2.288818359375e-05 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:26,
            Percentage of VRAM taken: 69.91579512747062%,
            
[2025-08-20 20:18:25,788][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/policy_optimizer_state.pt
[2025-08-20 20:18:25,790][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/critic_optimizer_state.pt
[2025-08-20 20:18:27,129][__main__][INFO] - Iteration 783 took 55s (38.16% Gen, 61.84% Train). Generation: 21s, Training: 34s. Estimated remaining time: 3h 13m 25s. Estimated total time: 15h 21m 21s. Time estimates for 10 more iterations: 9m 12s, 100 more iterations: 1h 32m 8s, 500 more iterations: 7h 40m 40s.
[2025-08-20 20:18:27,131][__main__][INFO] - Starting iteration 783.
[2025-08-20 20:18:50,228][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 20:18:50,230][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 20:18:50,236][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 20:18:52,696][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 20:18:52,698][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 20:18:52,704][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 20:18:52,706][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data.
[2025-08-20 20:18:52,707][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets.
[2025-08-20 20:18:53,006][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:18:53,795][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:18:54,589][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:18:55,381][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:18:56,173][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:18:56,968][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:18:57,762][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:18:58,553][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:18:59,350][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:19:00,145][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:19:00,938][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:19:01,732][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:19:02,528][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:19:03,323][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:19:04,116][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:19:04,908][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:19:05,702][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:19:06,493][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:19:07,286][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:19:08,080][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:19:09,349][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:19:10,142][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:19:10,934][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:19:11,727][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:19:12,522][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:19:13,315][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:19:14,108][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:19:14,903][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:19:15,699][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:19:16,492][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:19:17,286][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:19:18,085][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:19:19,652][mllm.training.trainer_common][INFO] - For task: Apply reinforce step
,
            ΔVRAM Allocated: 2.288818359375e-05 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:26,
            Percentage of VRAM taken: 69.91579512747062%,
            
[2025-08-20 20:19:20,575][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/policy_optimizer_state.pt
[2025-08-20 20:19:20,576][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/critic_optimizer_state.pt
[2025-08-20 20:19:22,019][__main__][INFO] - Iteration 784 took 54s (37.58% Gen, 62.42% Train). Generation: 20s, Training: 34s. Estimated remaining time: 3h 5m 56s. Estimated total time: 15h 14m 47s. Time estimates for 10 more iterations: 9m 8s, 100 more iterations: 1h 31m 28s, 500 more iterations: 7h 37m 23s.
[2025-08-20 20:19:22,021][__main__][INFO] - Starting iteration 784.
[2025-08-20 20:19:45,157][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 20:19:45,158][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 20:19:45,164][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 20:19:47,635][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 20:19:47,636][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 20:19:47,643][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 20:19:47,645][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data.
[2025-08-20 20:19:47,646][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets.
[2025-08-20 20:19:47,944][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:19:48,735][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:19:49,530][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:19:50,323][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:19:51,116][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:19:51,913][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:19:52,709][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:19:53,502][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:19:54,296][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:19:55,093][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:19:55,890][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:19:56,686][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:19:57,480][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:19:58,274][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:19:59,070][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:19:59,864][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:20:00,657][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:20:01,451][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:20:02,247][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:20:03,040][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:20:03,834][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:20:04,630][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:20:05,426][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:20:06,219][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:20:07,013][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:20:07,809][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:20:08,606][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:20:09,401][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:20:10,195][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:20:11,610][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:20:12,407][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:20:13,203][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:20:14,778][mllm.training.trainer_common][INFO] - For task: Apply reinforce step
,
            ΔVRAM Allocated: 2.288818359375e-05 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:27,
            Percentage of VRAM taken: 69.91579512747062%,
            
[2025-08-20 20:20:15,699][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/policy_optimizer_state.pt
[2025-08-20 20:20:15,700][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/critic_optimizer_state.pt
[2025-08-20 20:20:16,957][__main__][INFO] - Iteration 785 took 54s (37.63% Gen, 62.37% Train). Generation: 20s, Training: 34s. Estimated remaining time: 3h 5m 49s. Estimated total time: 15h 15m 35s. Time estimates for 10 more iterations: 9m 9s, 100 more iterations: 1h 31m 33s, 500 more iterations: 7h 37m 47s.
[2025-08-20 20:20:16,958][__main__][INFO] - Starting iteration 785.
[2025-08-20 20:20:40,544][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 20:20:40,545][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 20:20:40,552][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 20:20:42,997][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 20:20:42,998][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 20:20:43,005][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 20:20:43,007][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data.
[2025-08-20 20:20:43,007][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets.
[2025-08-20 20:20:43,306][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:20:44,097][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:20:44,890][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:20:45,683][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:20:46,478][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:20:47,273][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:20:48,068][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:20:48,861][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:20:49,656][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:20:50,453][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:20:51,248][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:20:52,042][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:20:52,836][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:20:53,632][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:20:54,426][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:20:55,218][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:20:56,013][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:20:56,809][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:20:57,602][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:20:58,395][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:20:59,192][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:20:59,988][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:21:01,193][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:21:01,989][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:21:02,784][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:21:03,578][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:21:04,372][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:21:05,170][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:21:05,966][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:21:06,759][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:21:07,553][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:21:08,350][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:21:09,950][mllm.training.trainer_common][INFO] - For task: Apply reinforce step
,
            ΔVRAM Allocated: 2.288818359375e-05 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:26,
            Percentage of VRAM taken: 69.91579512747062%,
            
[2025-08-20 20:21:11,456][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/policy_optimizer_state.pt
[2025-08-20 20:21:11,458][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/critic_optimizer_state.pt
[2025-08-20 20:21:12,839][__main__][INFO] - Iteration 786 took 55s (37.84% Gen, 62.16% Train). Generation: 21s, Training: 34s. Estimated remaining time: 3h 20m 38s. Estimated total time: 15h 31m 20s. Time estimates for 10 more iterations: 9m 18s, 100 more iterations: 1h 33m 8s, 500 more iterations: 7h 45m 40s.
[2025-08-20 20:21:12,841][__main__][INFO] - Starting iteration 786.
[2025-08-20 20:21:36,446][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 20:21:36,448][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 20:21:36,454][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 20:21:38,912][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 20:21:38,913][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 20:21:38,919][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 20:21:38,922][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data.
[2025-08-20 20:21:38,922][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets.
[2025-08-20 20:21:39,222][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:21:40,017][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:21:40,811][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:21:41,603][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:21:42,400][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:21:43,197][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:21:43,993][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:21:44,787][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:21:45,582][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:21:46,379][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:21:47,176][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:21:47,970][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:21:48,763][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:21:49,558][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:21:50,353][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:21:51,147][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:21:51,940][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:21:52,736][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:21:53,978][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:21:54,774][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:21:55,569][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:21:56,363][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:21:57,157][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:21:57,954][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:21:58,750][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:21:59,545][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:22:00,338][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:22:01,134][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:22:01,933][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:22:02,730][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:22:03,524][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:22:04,318][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:22:05,901][mllm.training.trainer_common][INFO] - For task: Apply reinforce step
,
            ΔVRAM Allocated: 2.288818359375e-05 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:26,
            Percentage of VRAM taken: 69.91579512747062%,
            
[2025-08-20 20:22:06,872][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/policy_optimizer_state.pt
[2025-08-20 20:22:06,873][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/critic_optimizer_state.pt
[2025-08-20 20:22:08,336][__main__][INFO] - Iteration 787 took 55s (38.09% Gen, 61.91% Train). Generation: 21s, Training: 34s. Estimated remaining time: 3h 13m 17s. Estimated total time: 15h 24m 54s. Time estimates for 10 more iterations: 9m 14s, 100 more iterations: 1h 32m 29s, 500 more iterations: 7h 42m 27s.
[2025-08-20 20:22:08,337][__main__][INFO] - Starting iteration 787.
[2025-08-20 20:22:31,968][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 20:22:31,969][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 20:22:31,975][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 20:22:34,475][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 20:22:34,476][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 20:22:34,483][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 20:22:34,485][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data.
[2025-08-20 20:22:34,485][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets.
[2025-08-20 20:22:34,784][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:22:35,579][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:22:36,372][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:22:37,166][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:22:37,962][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:22:38,758][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:22:39,552][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:22:40,345][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:22:41,142][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:22:41,938][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:22:42,730][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:22:43,524][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:22:44,318][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:22:45,113][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:22:45,906][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:22:46,701][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:22:47,993][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:22:48,787][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:22:49,579][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:22:50,376][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:22:51,170][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:22:51,962][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:22:52,757][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:22:53,553][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:22:54,346][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:22:55,139][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:22:55,935][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:22:56,731][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:22:57,524][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:22:58,317][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:22:59,115][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:22:59,912][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:23:01,527][mllm.training.trainer_common][INFO] - For task: Apply reinforce step
,
            ΔVRAM Allocated: 2.288818359375e-05 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:27,
            Percentage of VRAM taken: 69.91579512747062%,
            
[2025-08-20 20:23:02,461][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/policy_optimizer_state.pt
[2025-08-20 20:23:02,463][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/critic_optimizer_state.pt
[2025-08-20 20:23:03,739][__main__][INFO] - Iteration 788 took 55s (38.14% Gen, 61.86% Train). Generation: 21s, Training: 34s. Estimated remaining time: 3h 10m 48s. Estimated total time: 15h 23m 20s. Time estimates for 10 more iterations: 9m 14s, 100 more iterations: 1h 32m 20s, 500 more iterations: 7h 41m 40s.
[2025-08-20 20:23:03,740][__main__][INFO] - Starting iteration 788.
[2025-08-20 20:23:27,833][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 20:23:27,834][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 20:23:27,841][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 20:23:30,299][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 20:23:30,300][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 20:23:30,307][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 20:23:30,309][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data.
[2025-08-20 20:23:30,310][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets.
[2025-08-20 20:23:30,613][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:23:31,403][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:23:32,195][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:23:32,987][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:23:33,783][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:23:34,576][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:23:35,370][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:23:36,163][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:23:36,957][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:23:37,750][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:23:38,543][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:23:39,337][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:23:40,129][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:23:40,923][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:23:41,718][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:23:42,511][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:23:43,303][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:23:44,098][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:23:45,304][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:23:46,098][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:23:46,892][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:23:47,685][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:23:48,481][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:23:49,278][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:23:50,073][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:23:50,867][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:23:51,662][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:23:52,457][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:23:53,254][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:23:54,048][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:23:54,842][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:23:55,639][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:23:57,226][mllm.training.trainer_common][INFO] - For task: Apply reinforce step
,
            ΔVRAM Allocated: 2.288818359375e-05 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:26,
            Percentage of VRAM taken: 69.91579512747062%,
            
[2025-08-20 20:23:58,159][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/policy_optimizer_state.pt
[2025-08-20 20:23:58,161][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/critic_optimizer_state.pt
[2025-08-20 20:23:59,481][__main__][INFO] - Iteration 789 took 55s (38.82% Gen, 61.18% Train). Generation: 21s, Training: 34s. Estimated remaining time: 3h 15m 31s. Estimated total time: 15h 29m 0s. Time estimates for 10 more iterations: 9m 17s, 100 more iterations: 1h 32m 54s, 500 more iterations: 7h 44m 30s.
[2025-08-20 20:23:59,495][__main__][INFO] - Starting iteration 789.
[2025-08-20 20:24:23,033][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 20:24:23,038][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 20:24:23,048][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 20:24:25,501][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 20:24:25,502][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 20:24:25,508][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 20:24:25,511][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data.
[2025-08-20 20:24:25,511][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets.
[2025-08-20 20:24:25,854][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:24:26,646][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:24:27,441][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:24:28,234][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:24:29,029][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:24:29,826][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:24:30,622][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:24:31,414][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:24:32,207][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:24:33,003][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:24:33,796][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:24:34,589][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:24:35,385][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:24:36,181][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:24:36,974][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:24:37,768][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:24:38,564][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:24:39,361][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:24:40,155][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:24:40,949][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:24:41,744][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:24:42,542][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:24:43,337][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:24:44,131][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:24:45,404][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:24:46,200][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:24:46,994][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:24:47,788][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:24:48,583][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:24:49,380][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:24:50,177][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:24:50,971][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:24:52,565][mllm.training.trainer_common][INFO] - For task: Apply reinforce step
,
            ΔVRAM Allocated: 2.288818359375e-05 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:26,
            Percentage of VRAM taken: 69.91579512747062%,
            
[2025-08-20 20:24:53,492][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/policy_optimizer_state.pt
[2025-08-20 20:24:53,493][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/critic_optimizer_state.pt
[2025-08-20 20:24:54,844][__main__][INFO] - Iteration 790 took 55s (38.08% Gen, 61.92% Train). Generation: 21s, Training: 34s. Estimated remaining time: 3h 7m 49s. Estimated total time: 15h 22m 13s. Time estimates for 10 more iterations: 9m 13s, 100 more iterations: 1h 32m 13s, 500 more iterations: 7h 41m 6s.
[2025-08-20 20:24:54,845][__main__][INFO] - Starting iteration 790.
[2025-08-20 20:25:18,547][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 20:25:18,548][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 20:25:18,554][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 20:25:21,025][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 20:25:21,027][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 20:25:21,033][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 20:25:21,036][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data.
[2025-08-20 20:25:21,036][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets.
[2025-08-20 20:25:21,339][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:25:22,130][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:25:22,925][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:25:23,718][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:25:24,512][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:25:25,307][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:25:26,101][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:25:26,893][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:25:27,688][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:25:28,482][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:25:29,276][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:25:30,069][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:25:30,866][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:25:31,661][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:25:32,454][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:25:33,248][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:25:34,046][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:25:34,841][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:25:35,635][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:25:36,429][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:25:37,226][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:25:38,023][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:25:38,817][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:25:40,147][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:25:40,944][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:25:41,741][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:25:42,535][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:25:43,328][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:25:44,124][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:25:44,923][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:25:45,719][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:25:46,512][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:25:48,136][mllm.training.trainer_common][INFO] - For task: Apply reinforce step
,
            ΔVRAM Allocated: 2.288818359375e-05 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:27,
            Percentage of VRAM taken: 69.91579512747062%,
            
[2025-08-20 20:25:49,031][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/policy_optimizer_state.pt
[2025-08-20 20:25:49,032][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/critic_optimizer_state.pt
[2025-08-20 20:25:50,244][__main__][INFO] - Iteration 791 took 55s (38.35% Gen, 61.64% Train). Generation: 21s, Training: 34s. Estimated remaining time: 3h 7m 59s. Estimated total time: 15h 23m 18s. Time estimates for 10 more iterations: 9m 13s, 100 more iterations: 1h 32m 19s, 500 more iterations: 7h 41m 39s.
[2025-08-20 20:25:50,245][__main__][INFO] - Starting iteration 791.
[2025-08-20 20:26:13,602][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 20:26:13,604][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 20:26:13,610][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 20:26:16,070][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 20:26:16,072][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 20:26:16,078][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 20:26:16,080][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data.
[2025-08-20 20:26:16,081][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets.
[2025-08-20 20:26:16,380][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:26:17,172][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:26:17,965][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:26:18,757][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:26:19,553][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:26:20,350][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:26:21,145][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:26:21,938][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:26:22,732][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:26:23,528][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:26:24,321][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:26:25,114][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:26:25,909][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:26:26,704][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:26:27,497][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:26:28,292][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:26:29,089][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:26:29,885][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:26:30,678][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:26:31,472][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:26:32,752][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:26:33,546][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:26:34,340][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:26:35,134][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:26:35,929][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:26:36,727][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:26:37,523][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:26:38,317][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:26:39,111][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:26:39,907][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:26:40,705][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:26:41,502][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:26:43,092][mllm.training.trainer_common][INFO] - For task: Apply reinforce step
,
            ΔVRAM Allocated: 2.288818359375e-05 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:26,
            Percentage of VRAM taken: 69.91579512747062%,
            
[2025-08-20 20:26:44,026][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/policy_optimizer_state.pt
[2025-08-20 20:26:44,028][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/critic_optimizer_state.pt
[2025-08-20 20:26:45,367][__main__][INFO] - Iteration 792 took 55s (37.94% Gen, 62.06% Train). Generation: 20s, Training: 34s. Estimated remaining time: 3h 2m 27s. Estimated total time: 15h 18m 41s. Time estimates for 10 more iterations: 9m 11s, 100 more iterations: 1h 31m 52s, 500 more iterations: 7h 39m 20s.
[2025-08-20 20:26:45,369][__main__][INFO] - Starting iteration 792.
[2025-08-20 20:27:09,278][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 20:27:09,279][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 20:27:09,285][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 20:27:11,743][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 20:27:11,744][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 20:27:11,751][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 20:27:11,753][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data.
[2025-08-20 20:27:11,753][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets.
[2025-08-20 20:27:12,052][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:27:12,843][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:27:13,636][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:27:14,432][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:27:15,227][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:27:16,020][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:27:16,814][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:27:17,608][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:27:18,400][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:27:19,194][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:27:19,990][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:27:20,783][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:27:21,577][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:27:22,373][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:27:23,169][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:27:23,963][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:27:24,756][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:27:25,552][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:27:26,349][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:27:27,631][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:27:28,426][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:27:29,220][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:27:30,014][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:27:30,811][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:27:31,608][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:27:32,403][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:27:33,197][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:27:33,991][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:27:34,789][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:27:35,586][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:27:36,381][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:27:37,175][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:27:38,760][mllm.training.trainer_common][INFO] - For task: Apply reinforce step
,
            ΔVRAM Allocated: 2.288818359375e-05 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:26,
            Percentage of VRAM taken: 69.91579512747062%,
            
[2025-08-20 20:27:39,667][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/policy_optimizer_state.pt
[2025-08-20 20:27:39,669][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/critic_optimizer_state.pt
[2025-08-20 20:27:41,093][__main__][INFO] - Iteration 793 took 55s (38.49% Gen, 61.50% Train). Generation: 21s, Training: 34s. Estimated remaining time: 3h 11m 34s. Estimated total time: 15h 28m 44s. Time estimates for 10 more iterations: 9m 17s, 100 more iterations: 1h 32m 52s, 500 more iterations: 7h 44m 22s.
[2025-08-20 20:27:41,095][__main__][INFO] - Starting iteration 793.
[2025-08-20 20:28:04,856][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 20:28:04,857][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 20:28:04,864][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 20:28:07,329][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 20:28:07,331][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 20:28:07,337][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 20:28:07,339][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data.
[2025-08-20 20:28:07,340][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets.
[2025-08-20 20:28:07,638][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:28:08,430][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:28:09,222][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:28:10,017][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:28:10,812][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:28:11,603][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:28:12,397][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:28:13,192][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:28:13,984][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:28:14,778][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:28:15,574][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:28:16,367][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:28:17,160][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:28:17,954][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:28:18,748][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:28:19,541][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:28:20,823][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:28:21,617][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:28:22,413][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:28:23,208][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:28:24,001][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:28:24,795][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:28:25,592][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:28:26,388][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:28:27,182][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:28:27,976][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:28:28,773][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:28:29,570][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:28:30,366][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:28:31,160][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:28:31,955][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:28:32,753][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:28:34,389][mllm.training.trainer_common][INFO] - For task: Apply reinforce step
,
            ΔVRAM Allocated: 2.288818359375e-05 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:27,
            Percentage of VRAM taken: 69.91579512747062%,
            
[2025-08-20 20:28:35,325][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/policy_optimizer_state.pt
[2025-08-20 20:28:35,326][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/critic_optimizer_state.pt
[2025-08-20 20:28:36,563][__main__][INFO] - Iteration 794 took 55s (38.41% Gen, 61.59% Train). Generation: 21s, Training: 34s. Estimated remaining time: 3h 6m 22s. Estimated total time: 15h 24m 28s. Time estimates for 10 more iterations: 9m 14s, 100 more iterations: 1h 32m 26s, 500 more iterations: 7h 42m 14s.
[2025-08-20 20:28:36,565][__main__][INFO] - Starting iteration 794.
[2025-08-20 20:29:00,208][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 20:29:00,210][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 20:29:00,216][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 20:29:02,656][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 20:29:02,657][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 20:29:02,664][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 20:29:02,666][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data.
[2025-08-20 20:29:02,667][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets.
[2025-08-20 20:29:02,967][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:29:03,760][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:29:04,554][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:29:05,347][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:29:06,141][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:29:06,936][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:29:07,730][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:29:08,527][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:29:09,421][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:29:10,218][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:29:11,013][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:29:11,806][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:29:12,600][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:29:13,397][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:29:14,192][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:29:14,986][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:29:15,780][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:29:16,578][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:29:17,374][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:29:18,168][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:29:18,962][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:29:19,758][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:29:20,977][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:29:21,771][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:29:22,564][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:29:23,357][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:29:24,154][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:29:24,950][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:29:25,743][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:29:26,537][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:29:27,335][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:29:28,132][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:29:29,731][mllm.training.trainer_common][INFO] - For task: Apply reinforce step
,
            ΔVRAM Allocated: 2.288818359375e-05 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:27,
            Percentage of VRAM taken: 69.91579512747062%,
            
[2025-08-20 20:29:30,700][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/policy_optimizer_state.pt
[2025-08-20 20:29:30,701][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/critic_optimizer_state.pt
[2025-08-20 20:29:32,003][__main__][INFO] - Iteration 795 took 55s (38.24% Gen, 61.76% Train). Generation: 21s, Training: 34s. Estimated remaining time: 3h 4m 56s. Estimated total time: 15h 23m 57s. Time estimates for 10 more iterations: 9m 14s, 100 more iterations: 1h 32m 23s, 500 more iterations: 7h 41m 58s.
[2025-08-20 20:29:32,006][__main__][INFO] - Starting iteration 795.
[2025-08-20 20:29:55,404][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 20:29:55,405][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 20:29:55,412][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 20:29:57,878][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 20:29:57,879][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 20:29:57,886][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 20:29:57,888][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data.
[2025-08-20 20:29:57,889][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets.
[2025-08-20 20:29:58,189][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:29:58,980][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:29:59,770][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:30:00,561][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:30:01,353][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:30:02,143][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:30:02,935][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:30:03,726][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:30:04,519][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:30:05,311][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:30:06,102][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:30:06,897][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:30:07,689][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:30:08,482][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:30:09,277][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:30:10,069][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:30:10,861][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:30:11,657][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:30:12,448][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:30:13,747][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:30:14,541][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:30:15,338][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:30:16,134][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:30:16,928][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:30:17,723][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:30:18,519][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:30:19,317][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:30:20,114][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:30:20,908][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:30:21,702][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:30:22,498][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:30:23,296][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:30:24,880][mllm.training.trainer_common][INFO] - For task: Apply reinforce step
,
            ΔVRAM Allocated: 2.288818359375e-05 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:26,
            Percentage of VRAM taken: 69.91579512747062%,
            
[2025-08-20 20:30:25,837][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/policy_optimizer_state.pt
[2025-08-20 20:30:25,839][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/critic_optimizer_state.pt
[2025-08-20 20:30:27,115][__main__][INFO] - Iteration 796 took 55s (37.99% Gen, 62.01% Train). Generation: 20s, Training: 34s. Estimated remaining time: 2h 58m 30s. Estimated total time: 15h 18m 26s. Time estimates for 10 more iterations: 9m 11s, 100 more iterations: 1h 31m 50s, 500 more iterations: 7h 39m 13s.
[2025-08-20 20:30:27,117][__main__][INFO] - Starting iteration 796.
[2025-08-20 20:30:50,575][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 20:30:50,577][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 20:30:50,583][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 20:30:53,055][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 20:30:53,057][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 20:30:53,064][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 20:30:53,066][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data.
[2025-08-20 20:30:53,066][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets.
[2025-08-20 20:30:53,365][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:30:54,156][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:30:54,948][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:30:55,742][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:30:56,534][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:30:57,326][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:30:58,122][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:30:58,915][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:30:59,708][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:31:00,503][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:31:01,298][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:31:02,091][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:31:02,886][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:31:03,682][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:31:04,477][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:31:05,270][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:31:06,065][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:31:06,862][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:31:07,657][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:31:08,451][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:31:09,245][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:31:10,041][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:31:10,838][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:31:11,631][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:31:12,425][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:31:13,734][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:31:14,525][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:31:15,320][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:31:16,114][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:31:16,906][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:31:17,699][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:31:18,495][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:31:20,093][mllm.training.trainer_common][INFO] - For task: Apply reinforce step
,
            ΔVRAM Allocated: 2.288818359375e-05 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:26,
            Percentage of VRAM taken: 69.91579512747062%,
            
[2025-08-20 20:31:21,052][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/policy_optimizer_state.pt
[2025-08-20 20:31:21,054][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/critic_optimizer_state.pt
[2025-08-20 20:31:22,343][__main__][INFO] - Iteration 797 took 55s (38.01% Gen, 61.98% Train). Generation: 20s, Training: 34s. Estimated remaining time: 2h 59m 35s. Estimated total time: 15h 20m 26s. Time estimates for 10 more iterations: 9m 12s, 100 more iterations: 1h 32m 2s, 500 more iterations: 7h 40m 13s.
[2025-08-20 20:31:22,345][__main__][INFO] - Starting iteration 797.
[2025-08-20 20:31:45,873][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 20:31:45,874][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 20:31:45,880][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 20:31:48,323][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 20:31:48,324][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 20:31:48,331][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 20:31:48,333][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data.
[2025-08-20 20:31:48,334][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets.
[2025-08-20 20:31:48,638][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:31:49,427][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:31:50,219][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:31:51,010][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:31:51,806][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:31:52,598][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:31:53,390][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:31:54,187][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:31:54,982][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:31:55,774][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:31:56,569][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:31:57,365][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:31:58,160][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:31:58,954][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:31:59,748][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:32:00,545][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:32:01,839][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:32:02,631][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:32:03,425][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:32:04,220][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:32:05,012][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:32:05,806][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:32:06,602][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:32:07,396][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:32:08,189][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:32:08,984][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:32:09,780][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:32:10,574][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:32:11,367][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:32:12,163][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:32:12,959][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:32:13,752][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:32:15,329][mllm.training.trainer_common][INFO] - For task: Apply reinforce step
,
            ΔVRAM Allocated: 2.288818359375e-05 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:26,
            Percentage of VRAM taken: 69.91579512747062%,
            
[2025-08-20 20:32:16,300][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/policy_optimizer_state.pt
[2025-08-20 20:32:16,301][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/critic_optimizer_state.pt
[2025-08-20 20:32:17,619][__main__][INFO] - Iteration 798 took 55s (38.12% Gen, 61.88% Train). Generation: 21s, Training: 34s. Estimated remaining time: 2h 59m 27s. Estimated total time: 15h 21m 13s. Time estimates for 10 more iterations: 9m 12s, 100 more iterations: 1h 32m 7s, 500 more iterations: 7h 40m 36s.
[2025-08-20 20:32:17,620][__main__][INFO] - Starting iteration 798.
[2025-08-20 20:32:41,252][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 20:32:41,253][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 20:32:41,259][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 20:32:43,706][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 20:32:43,707][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 20:32:43,714][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 20:32:43,716][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data.
[2025-08-20 20:32:43,716][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets.
[2025-08-20 20:32:44,016][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:32:44,808][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:32:45,599][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:32:46,390][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:32:47,182][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:32:47,974][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:32:48,770][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:32:49,563][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:32:50,356][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:32:51,150][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:32:51,946][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:32:52,738][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:32:53,531][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:32:54,327][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:32:55,123][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:32:55,916][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:32:56,710][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:32:57,507][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:32:58,303][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:32:59,097][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:33:00,361][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:33:01,154][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:33:01,949][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:33:02,745][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:33:03,541][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:33:04,334][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:33:05,129][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:33:05,926][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:33:06,723][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:33:07,517][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:33:08,311][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:33:09,106][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:33:10,676][mllm.training.trainer_common][INFO] - For task: Apply reinforce step
,
            ΔVRAM Allocated: 2.288818359375e-05 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:26,
            Percentage of VRAM taken: 69.91579512747062%,
            
[2025-08-20 20:33:11,555][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/policy_optimizer_state.pt
[2025-08-20 20:33:11,557][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/critic_optimizer_state.pt
[2025-08-20 20:33:12,828][__main__][INFO] - Iteration 799 took 55s (38.38% Gen, 61.61% Train). Generation: 21s, Training: 34s. Estimated remaining time: 2h 57m 25s. Estimated total time: 15h 20m 7s. Time estimates for 10 more iterations: 9m 12s, 100 more iterations: 1h 32m 0s, 500 more iterations: 7h 40m 3s.
[2025-08-20 20:33:12,830][__main__][INFO] - Starting iteration 799.
[2025-08-20 20:33:36,361][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 20:33:36,362][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 20:33:36,368][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 20:33:38,797][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 20:33:38,798][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 20:33:38,804][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 20:33:38,807][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data.
[2025-08-20 20:33:38,807][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets.
[2025-08-20 20:33:39,106][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:33:39,895][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:33:40,688][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:33:41,479][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:33:42,273][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:33:43,067][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:33:43,860][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:33:45,073][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:33:45,867][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:33:46,660][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:33:47,454][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:33:48,248][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:33:49,040][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:33:49,833][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:33:50,630][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:33:51,425][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:33:52,219][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:33:53,013][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:33:53,810][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:33:54,606][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:33:55,399][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:33:56,193][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:33:56,990][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:33:57,786][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:33:58,581][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:33:59,375][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:34:00,691][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:34:01,488][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:34:02,283][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:34:03,077][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:34:03,870][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:34:04,667][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:34:06,264][mllm.training.trainer_common][INFO] - For task: Apply reinforce step
,
            ΔVRAM Allocated: 2.288818359375e-05 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:27,
            Percentage of VRAM taken: 69.91579512747062%,
            
[2025-08-20 20:34:07,227][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/policy_optimizer_state.pt
[2025-08-20 20:34:07,228][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/critic_optimizer_state.pt
[2025-08-20 20:34:08,563][__main__][INFO] - Iteration 800 took 55s (37.87% Gen, 62.13% Train). Generation: 21s, Training: 34s. Estimated remaining time: 3h 5m 14s. Estimated total time: 15h 28m 52s. Time estimates for 10 more iterations: 9m 17s, 100 more iterations: 1h 32m 53s, 500 more iterations: 7h 44m 26s.
[2025-08-20 20:34:08,564][__main__][INFO] - Starting iteration 800.
[2025-08-20 20:34:31,849][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 20:34:31,851][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 20:34:31,857][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 20:34:34,315][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 20:34:34,317][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 20:34:34,323][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 20:34:34,326][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data.
[2025-08-20 20:34:34,326][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets.
[2025-08-20 20:34:34,624][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:34:35,415][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:34:36,207][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:34:36,998][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:34:37,794][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:34:38,587][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:34:39,379][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:34:40,172][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:34:40,965][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:34:41,758][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:34:42,554][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:34:43,348][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:34:44,141][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:34:44,936][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:34:45,733][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:34:46,527][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:34:47,321][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:34:48,115][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:34:48,913][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:34:49,709][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:34:50,502][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:34:51,296][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:34:52,093][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:34:52,890][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:34:53,684][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:34:54,479][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:34:55,275][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:34:56,542][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:34:57,336][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:34:58,132][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:34:58,929][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:34:59,726][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:35:01,332][mllm.training.trainer_common][INFO] - For task: Apply reinforce step
,
            ΔVRAM Allocated: 2.288818359375e-05 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:26,
            Percentage of VRAM taken: 69.91579512747062%,
            
[2025-08-20 20:35:02,234][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/policy_optimizer_state.pt
[2025-08-20 20:35:02,236][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/critic_optimizer_state.pt
[2025-08-20 20:35:06,218][__main__][INFO] - Iteration 801 took 57s (36.12% Gen, 59.39% Train). Generation: 20s, Training: 34s. Estimated remaining time: 3h 36m 18s. Estimated total time: 16h 0m 53s. Time estimates for 10 more iterations: 9m 36s, 100 more iterations: 1h 36m 5s, 500 more iterations: 8h 0m 26s.
[2025-08-20 20:35:06,220][__main__][INFO] - Starting iteration 801.
[2025-08-20 20:35:30,032][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 20:35:30,033][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 20:35:30,040][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 20:35:32,496][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 20:35:32,497][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 20:35:32,503][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 20:35:32,505][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data.
[2025-08-20 20:35:32,506][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets.
[2025-08-20 20:35:32,804][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:35:33,597][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:35:34,388][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:35:35,180][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:35:35,976][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:35:36,768][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:35:37,561][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:35:38,357][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:35:39,151][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:35:39,943][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:35:40,738][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:35:41,534][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:35:42,328][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:35:43,121][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:35:43,917][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:35:44,713][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:35:45,506][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:35:46,299][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:35:47,096][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:35:47,892][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:35:48,687][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:35:49,480][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:35:50,275][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:35:51,072][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:35:51,866][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:35:52,660][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:35:53,455][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:35:54,700][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:35:55,494][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:35:56,292][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:35:57,088][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:35:57,883][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:35:59,489][mllm.training.trainer_common][INFO] - For task: Apply reinforce step
,
            ΔVRAM Allocated: 2.288818359375e-05 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:26,
            Percentage of VRAM taken: 69.91579512747062%,
            
[2025-08-20 20:36:00,409][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/policy_optimizer_state.pt
[2025-08-20 20:36:00,411][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/critic_optimizer_state.pt
[2025-08-20 20:36:01,710][__main__][INFO] - Iteration 802 took 55s (38.48% Gen, 61.52% Train). Generation: 21s, Training: 34s. Estimated remaining time: 2h 59m 18s. Estimated total time: 15h 24m 49s. Time estimates for 10 more iterations: 9m 14s, 100 more iterations: 1h 32m 28s, 500 more iterations: 7h 42m 24s.
[2025-08-20 20:36:01,711][__main__][INFO] - Starting iteration 802.
[2025-08-20 20:36:25,780][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 20:36:25,781][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 20:36:25,787][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 20:36:28,265][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 20:36:28,266][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 20:36:28,272][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 20:36:28,274][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data.
[2025-08-20 20:36:28,275][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets.
[2025-08-20 20:36:28,573][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:36:29,362][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:36:30,154][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:36:30,945][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:36:31,740][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:36:32,532][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:36:33,324][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:36:34,119][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:36:34,913][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:36:35,705][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:36:36,499][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:36:37,293][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:36:38,085][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:36:38,879][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:36:39,675][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:36:40,468][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:36:41,261][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:36:42,058][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:36:42,853][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:36:43,647][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:36:44,440][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:36:45,237][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:36:46,565][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:36:47,360][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:36:48,157][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:36:48,953][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:36:49,748][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:36:50,541][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:36:51,337][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:36:52,136][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:36:52,936][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:36:53,733][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:36:55,357][mllm.training.trainer_common][INFO] - For task: Apply reinforce step
,
            ΔVRAM Allocated: 2.288818359375e-05 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:27,
            Percentage of VRAM taken: 69.91579512747062%,
            
[2025-08-20 20:36:56,285][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/policy_optimizer_state.pt
[2025-08-20 20:36:56,286][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/critic_optimizer_state.pt
[2025-08-20 20:36:57,689][__main__][INFO] - Iteration 803 took 55s (38.56% Gen, 61.44% Train). Generation: 21s, Training: 34s. Estimated remaining time: 3h 6m 31s. Estimated total time: 15h 32m 58s. Time estimates for 10 more iterations: 9m 19s, 100 more iterations: 1h 33m 17s, 500 more iterations: 7h 46m 29s.
[2025-08-20 20:36:57,691][__main__][INFO] - Starting iteration 803.
[2025-08-20 20:37:21,055][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 20:37:21,057][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 20:37:21,063][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 20:37:23,518][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 20:37:23,519][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 20:37:23,526][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 20:37:23,528][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data.
[2025-08-20 20:37:23,528][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets.
[2025-08-20 20:37:23,831][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:37:24,623][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:37:25,416][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:37:26,208][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:37:27,005][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:37:27,801][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:37:28,596][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:37:29,389][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:37:30,184][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:37:30,981][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:37:31,774][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:37:32,567][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:37:33,363][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:37:34,159][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:37:34,952][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:37:35,747][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:37:36,543][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:37:37,339][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:37:38,556][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:37:39,349][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:37:40,144][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:37:40,940][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:37:41,736][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:37:42,530][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:37:43,324][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:37:44,120][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:37:44,917][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:37:45,713][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:37:46,506][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:37:47,301][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:37:48,099][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:37:48,897][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:37:50,484][mllm.training.trainer_common][INFO] - For task: Apply reinforce step
,
            ΔVRAM Allocated: 2.288818359375e-05 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:26,
            Percentage of VRAM taken: 69.91579512747062%,
            
[2025-08-20 20:37:51,420][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/policy_optimizer_state.pt
[2025-08-20 20:37:51,421][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/critic_optimizer_state.pt
[2025-08-20 20:37:52,816][__main__][INFO] - Iteration 804 took 55s (37.93% Gen, 62.07% Train). Generation: 20s, Training: 34s. Estimated remaining time: 2h 51m 22s. Estimated total time: 15h 18m 44s. Time estimates for 10 more iterations: 9m 11s, 100 more iterations: 1h 31m 52s, 500 more iterations: 7h 39m 22s.
[2025-08-20 20:37:52,817][__main__][INFO] - Starting iteration 804.
[2025-08-20 20:38:16,830][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 20:38:16,831][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 20:38:16,837][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 20:38:19,283][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 20:38:19,285][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 20:38:19,291][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 20:38:19,293][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data.
[2025-08-20 20:38:19,294][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets.
[2025-08-20 20:38:19,593][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:38:20,384][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:38:21,177][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:38:21,969][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:38:22,764][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:38:23,557][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:38:24,352][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:38:25,147][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:38:25,943][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:38:26,735][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:38:27,528][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:38:28,325][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:38:29,120][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:38:29,913][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:38:30,707][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:38:31,503][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:38:32,299][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:38:33,092][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:38:33,886][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:38:34,683][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:38:35,478][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:38:36,272][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:38:37,067][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:38:38,309][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:38:39,103][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:38:39,901][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:38:40,695][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:38:41,488][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:38:42,283][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:38:43,080][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:38:43,878][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:38:44,673][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:38:46,277][mllm.training.trainer_common][INFO] - For task: Apply reinforce step
,
            ΔVRAM Allocated: 2.288818359375e-05 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:26,
            Percentage of VRAM taken: 69.91579512747062%,
            
[2025-08-20 20:38:47,177][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/policy_optimizer_state.pt
[2025-08-20 20:38:47,179][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/critic_optimizer_state.pt
[2025-08-20 20:38:48,584][__main__][INFO] - Iteration 805 took 55s (38.69% Gen, 61.31% Train). Generation: 21s, Training: 34s. Estimated remaining time: 3h 1m 8s. Estimated total time: 15h 29m 26s. Time estimates for 10 more iterations: 9m 17s, 100 more iterations: 1h 32m 56s, 500 more iterations: 7h 44m 43s.
[2025-08-20 20:38:48,585][__main__][INFO] - Starting iteration 805.
[2025-08-20 20:39:11,760][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 20:39:11,761][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 20:39:11,767][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 20:39:14,226][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 20:39:14,227][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 20:39:14,233][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 20:39:14,235][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data.
[2025-08-20 20:39:14,236][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets.
[2025-08-20 20:39:14,534][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:39:15,326][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:39:16,118][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:39:16,910][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:39:17,706][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:39:18,498][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:39:19,290][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:39:20,085][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:39:20,879][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:39:21,671][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:39:22,467][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:39:23,261][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:39:24,054][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:39:24,848][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:39:25,643][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:39:26,436][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:39:27,230][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:39:28,027][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:39:28,824][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:39:29,619][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:39:30,412][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:39:31,721][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:39:32,514][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:39:33,308][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:39:34,105][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:39:34,901][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:39:35,696][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:39:36,489][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:39:37,286][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:39:38,086][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:39:38,884][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:39:39,680][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:39:41,364][mllm.training.trainer_common][INFO] - For task: Apply reinforce step
,
            ΔVRAM Allocated: 2.288818359375e-05 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:27,
            Percentage of VRAM taken: 69.91579512747062%,
            
[2025-08-20 20:39:42,322][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/policy_optimizer_state.pt
[2025-08-20 20:39:42,324][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/critic_optimizer_state.pt
[2025-08-20 20:39:43,570][__main__][INFO] - Iteration 806 took 54s (37.73% Gen, 62.27% Train). Generation: 20s, Training: 34s. Estimated remaining time: 2h 47m 12s. Estimated total time: 15h 16m 24s. Time estimates for 10 more iterations: 9m 9s, 100 more iterations: 1h 31m 38s, 500 more iterations: 7h 38m 12s.
[2025-08-20 20:39:43,572][__main__][INFO] - Starting iteration 806.
[2025-08-20 20:40:07,267][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 20:40:07,269][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 20:40:07,275][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 20:40:09,725][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 20:40:09,726][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 20:40:09,733][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 20:40:09,735][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data.
[2025-08-20 20:40:09,735][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets.
[2025-08-20 20:40:10,045][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:40:10,835][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:40:11,630][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:40:12,423][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:40:13,215][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:40:14,011][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:40:14,805][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:40:15,598][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:40:16,392][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:40:17,189][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:40:17,982][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:40:18,776][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:40:19,570][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:40:20,364][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:40:21,158][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:40:21,952][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:40:22,749][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:40:23,546][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:40:24,340][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:40:25,133][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:40:25,929][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:40:26,725][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:40:27,519][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:40:28,313][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:40:29,109][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:40:30,316][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:40:31,111][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:40:31,908][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:40:32,706][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:40:33,502][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:40:34,296][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:40:35,090][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:40:36,941][mllm.training.trainer_common][INFO] - For task: Apply reinforce step
,
            ΔVRAM Allocated: 2.288818359375e-05 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:27,
            Percentage of VRAM taken: 69.91579512747062%,
            
[2025-08-20 20:40:37,904][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/policy_optimizer_state.pt
[2025-08-20 20:40:37,905][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/critic_optimizer_state.pt
[2025-08-20 20:40:39,247][__main__][INFO] - Iteration 807 took 55s (38.15% Gen, 61.85% Train). Generation: 21s, Training: 34s. Estimated remaining time: 2h 57m 46s. Estimated total time: 15h 27m 54s. Time estimates for 10 more iterations: 9m 16s, 100 more iterations: 1h 32m 47s, 500 more iterations: 7h 43m 57s.
[2025-08-20 20:40:39,248][__main__][INFO] - Starting iteration 807.
[2025-08-20 20:41:02,882][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 20:41:02,883][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 20:41:02,889][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 20:41:05,347][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 20:41:05,348][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 20:41:05,355][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 20:41:05,357][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data.
[2025-08-20 20:41:05,358][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets.
[2025-08-20 20:41:05,657][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:41:06,449][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:41:07,241][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:41:08,035][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:41:08,829][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:41:09,622][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:41:10,416][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:41:11,212][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:41:12,006][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:41:12,798][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:41:13,592][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:41:14,386][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:41:15,180][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:41:15,974][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:41:16,771][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:41:17,565][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:41:18,358][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:41:19,154][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:41:19,951][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:41:21,216][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:41:22,012][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:41:22,808][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:41:23,601][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:41:24,394][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:41:25,191][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:41:25,986][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:41:26,781][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:41:27,576][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:41:28,370][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:41:29,169][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:41:29,966][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:41:30,761][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:41:32,382][mllm.training.trainer_common][INFO] - For task: Apply reinforce step
,
            ΔVRAM Allocated: 2.288818359375e-05 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:26,
            Percentage of VRAM taken: 69.91579512747062%,
            
[2025-08-20 20:41:33,345][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/policy_optimizer_state.pt
[2025-08-20 20:41:33,347][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/critic_optimizer_state.pt
[2025-08-20 20:41:34,634][__main__][INFO] - Iteration 808 took 55s (38.26% Gen, 61.74% Train). Generation: 21s, Training: 34s. Estimated remaining time: 2h 52m 1s. Estimated total time: 15h 23m 5s. Time estimates for 10 more iterations: 9m 13s, 100 more iterations: 1h 32m 18s, 500 more iterations: 7h 41m 32s.
[2025-08-20 20:41:34,635][__main__][INFO] - Starting iteration 808.
[2025-08-20 20:41:58,226][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 20:41:58,228][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 20:41:58,234][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 20:42:00,709][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 20:42:00,711][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 20:42:00,717][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 20:42:00,719][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data.
[2025-08-20 20:42:00,720][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets.
[2025-08-20 20:42:01,024][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:42:01,815][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:42:02,607][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:42:03,399][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:42:04,195][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:42:04,987][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:42:05,780][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:42:06,576][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:42:07,370][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:42:08,164][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:42:08,957][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:42:09,753][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:42:10,545][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:42:11,338][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:42:12,135][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:42:12,930][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:42:13,724][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:42:14,517][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:42:15,311][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:42:16,685][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:42:17,478][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:42:18,274][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:42:19,069][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:42:19,863][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:42:20,657][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:42:21,454][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:42:22,251][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:42:23,048][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:42:23,843][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:42:24,637][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:42:25,434][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:42:26,230][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:42:27,889][mllm.training.trainer_common][INFO] - For task: Apply reinforce step
,
            ΔVRAM Allocated: 2.288818359375e-05 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:27,
            Percentage of VRAM taken: 69.91579512747062%,
            
[2025-08-20 20:42:28,817][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/policy_optimizer_state.pt
[2025-08-20 20:42:28,818][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/critic_optimizer_state.pt
[2025-08-20 20:42:30,083][__main__][INFO] - Iteration 809 took 55s (38.13% Gen, 61.87% Train). Generation: 21s, Training: 34s. Estimated remaining time: 2h 52m 8s. Estimated total time: 15h 24m 7s. Time estimates for 10 more iterations: 9m 14s, 100 more iterations: 1h 32m 24s, 500 more iterations: 7h 42m 3s.
[2025-08-20 20:42:30,084][__main__][INFO] - Starting iteration 809.
[2025-08-20 20:42:53,445][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 20:42:53,447][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 20:42:53,453][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 20:42:55,905][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 20:42:55,907][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 20:42:55,914][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 20:42:55,915][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data.
[2025-08-20 20:42:55,916][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets.
[2025-08-20 20:42:56,215][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:42:57,004][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:42:57,799][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:42:58,592][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:42:59,385][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:43:00,180][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:43:00,975][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:43:01,767][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:43:02,561][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:43:03,356][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:43:04,150][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:43:04,943][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:43:05,738][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:43:06,534][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:43:07,329][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:43:08,123][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:43:08,916][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:43:10,127][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:43:10,920][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:43:11,715][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:43:12,508][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:43:13,300][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:43:14,096][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:43:14,891][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:43:15,684][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:43:16,478][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:43:17,275][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:43:18,069][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:43:18,863][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:43:19,657][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:43:20,455][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:43:21,252][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:43:22,879][mllm.training.trainer_common][INFO] - For task: Apply reinforce step
,
            ΔVRAM Allocated: 2.288818359375e-05 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:26,
            Percentage of VRAM taken: 69.91579512747062%,
            
[2025-08-20 20:43:23,777][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/policy_optimizer_state.pt
[2025-08-20 20:43:23,778][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/critic_optimizer_state.pt
[2025-08-20 20:43:25,088][__main__][INFO] - Iteration 810 took 55s (38.02% Gen, 61.97% Train). Generation: 20s, Training: 34s. Estimated remaining time: 2h 43m 49s. Estimated total time: 15h 16m 43s. Time estimates for 10 more iterations: 9m 10s, 100 more iterations: 1h 31m 40s, 500 more iterations: 7h 38m 21s.
[2025-08-20 20:43:25,090][__main__][INFO] - Starting iteration 810.
[2025-08-20 20:43:49,047][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 20:43:49,048][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 20:43:49,055][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 20:43:51,506][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 20:43:51,508][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 20:43:51,514][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 20:43:51,516][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data.
[2025-08-20 20:43:51,517][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets.
[2025-08-20 20:43:51,816][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:43:52,607][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:43:53,402][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:43:54,196][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:43:54,989][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:43:55,783][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:43:56,578][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:43:57,370][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:43:58,164][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:43:58,960][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:43:59,755][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:44:00,548][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:44:01,344][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:44:02,140][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:44:02,936][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:44:03,729][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:44:04,523][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:44:05,319][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:44:06,115][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:44:07,468][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:44:08,262][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:44:09,059][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:44:09,855][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:44:10,649][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:44:11,443][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:44:12,239][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:44:13,036][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:44:13,833][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:44:14,629][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:44:15,424][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:44:16,220][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:44:17,018][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:44:18,651][mllm.training.trainer_common][INFO] - For task: Apply reinforce step
,
            ΔVRAM Allocated: 2.288818359375e-05 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:27,
            Percentage of VRAM taken: 69.91579512747062%,
            
[2025-08-20 20:44:19,605][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/policy_optimizer_state.pt
[2025-08-20 20:44:19,606][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/critic_optimizer_state.pt
[2025-08-20 20:44:20,945][__main__][INFO] - Iteration 811 took 55s (38.51% Gen, 61.49% Train). Generation: 21s, Training: 34s. Estimated remaining time: 2h 57m 5s. Estimated total time: 15h 30m 55s. Time estimates for 10 more iterations: 9m 18s, 100 more iterations: 1h 33m 5s, 500 more iterations: 7h 45m 27s.
[2025-08-20 20:44:20,947][__main__][INFO] - Starting iteration 811.
[2025-08-20 20:44:44,287][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 20:44:44,288][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 20:44:44,295][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 20:44:46,763][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 20:44:46,764][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 20:44:46,771][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 20:44:46,773][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data.
[2025-08-20 20:44:46,773][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets.
[2025-08-20 20:44:47,076][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:44:47,865][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:44:48,658][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:44:49,449][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:44:50,245][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:44:51,039][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:44:51,831][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:44:52,625][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:44:53,420][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:44:54,212][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:44:55,006][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:44:55,803][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:44:56,598][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:44:57,391][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:44:58,187][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:44:58,984][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:44:59,782][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:45:00,577][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:45:01,370][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:45:02,165][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:45:02,962][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:45:03,758][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:45:04,552][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:45:05,347][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:45:06,670][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:45:07,462][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:45:08,260][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:45:09,054][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:45:09,847][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:45:10,641][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:45:11,439][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:45:12,234][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:45:13,897][mllm.training.trainer_common][INFO] - For task: Apply reinforce step
,
            ΔVRAM Allocated: 2.288818359375e-05 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:27,
            Percentage of VRAM taken: 69.91579512747062%,
            
[2025-08-20 20:45:14,836][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/policy_optimizer_state.pt
[2025-08-20 20:45:14,837][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/critic_optimizer_state.pt
[2025-08-20 20:45:16,090][__main__][INFO] - Iteration 812 took 55s (37.84% Gen, 62.16% Train). Generation: 20s, Training: 34s. Estimated remaining time: 2h 44m 18s. Estimated total time: 15h 19m 3s. Time estimates for 10 more iterations: 9m 11s, 100 more iterations: 1h 31m 54s, 500 more iterations: 7h 39m 31s.
[2025-08-20 20:45:16,092][__main__][INFO] - Starting iteration 812.
[2025-08-20 20:45:39,933][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 20:45:39,934][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 20:45:39,941][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 20:45:42,389][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 20:45:42,391][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 20:45:42,397][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 20:45:42,399][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data.
[2025-08-20 20:45:42,400][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets.
[2025-08-20 20:45:42,697][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:45:43,486][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:45:44,277][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:45:45,068][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:45:45,860][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:45:46,651][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:45:47,447][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:45:48,241][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:45:49,033][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:45:49,826][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:45:50,619][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:45:51,411][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:45:52,206][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:45:53,001][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:45:53,793][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:45:54,587][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:45:55,382][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:45:56,175][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:45:56,967][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:45:57,763][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:45:58,557][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:45:59,351][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:46:00,147][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:46:00,943][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:46:02,149][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:46:02,944][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:46:03,742][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:46:04,539][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:46:05,334][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:46:06,129][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:46:06,926][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:46:07,724][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:46:09,366][mllm.training.trainer_common][INFO] - For task: Apply reinforce step
,
            ΔVRAM Allocated: 2.288818359375e-05 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:26,
            Percentage of VRAM taken: 69.91579512747062%,
            
[2025-08-20 20:46:10,306][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/policy_optimizer_state.pt
[2025-08-20 20:46:10,308][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/critic_optimizer_state.pt
[2025-08-20 20:46:11,625][__main__][INFO] - Iteration 813 took 55s (38.53% Gen, 61.47% Train). Generation: 21s, Training: 34s. Estimated remaining time: 2h 49m 52s. Estimated total time: 15h 25m 33s. Time estimates for 10 more iterations: 9m 15s, 100 more iterations: 1h 32m 33s, 500 more iterations: 7h 42m 46s.
[2025-08-20 20:46:11,627][__main__][INFO] - Starting iteration 813.
[2025-08-20 20:46:35,509][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 20:46:35,511][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 20:46:35,517][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 20:46:38,003][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 20:46:38,004][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 20:46:38,010][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 20:46:38,013][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data.
[2025-08-20 20:46:38,013][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets.
[2025-08-20 20:46:38,312][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:46:39,104][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:46:39,895][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:46:40,690][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:46:41,486][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:46:42,279][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:46:43,074][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:46:43,870][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:46:44,665][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:46:45,458][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:46:46,253][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:46:47,049][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:46:47,845][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:46:48,638][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:46:49,432][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:46:50,228][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:46:51,024][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:46:51,817][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:46:52,612][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:46:53,408][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:46:54,205][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:46:55,001][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:46:55,795][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:46:56,590][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:46:57,388][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:46:58,187][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:46:58,985][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:46:59,782][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:47:01,043][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:47:01,838][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:47:02,633][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:47:03,428][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:47:05,058][mllm.training.trainer_common][INFO] - For task: Apply reinforce step
,
            ΔVRAM Allocated: 2.288818359375e-05 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:27,
            Percentage of VRAM taken: 69.91579512747062%,
            
[2025-08-20 20:47:05,988][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/policy_optimizer_state.pt
[2025-08-20 20:47:05,989][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/critic_optimizer_state.pt
[2025-08-20 20:47:07,578][__main__][INFO] - Iteration 814 took 55s (38.26% Gen, 61.74% Train). Generation: 21s, Training: 34s. Estimated remaining time: 2h 55m 54s. Estimated total time: 15h 32m 31s. Time estimates for 10 more iterations: 9m 19s, 100 more iterations: 1h 33m 15s, 500 more iterations: 7h 46m 15s.
[2025-08-20 20:47:07,580][__main__][INFO] - Starting iteration 814.
[2025-08-20 20:47:31,563][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 20:47:31,564][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 20:47:31,570][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 20:47:34,027][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 20:47:34,028][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 20:47:34,034][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 20:47:34,037][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data.
[2025-08-20 20:47:34,037][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets.
[2025-08-20 20:47:34,335][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:47:35,128][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:47:35,920][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:47:36,715][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:47:37,511][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:47:38,304][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:47:39,097][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:47:39,893][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:47:40,688][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:47:41,481][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:47:42,275][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:47:43,071][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:47:43,865][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:47:44,659][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:47:45,453][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:47:46,250][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:47:47,044][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:47:47,838][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:47:48,633][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:47:49,431][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:47:50,227][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:47:51,022][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:47:51,817][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:47:53,129][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:47:53,925][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:47:54,720][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:47:55,515][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:47:56,311][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:47:57,110][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:47:57,908][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:47:58,705][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:47:59,501][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:48:01,133][mllm.training.trainer_common][INFO] - For task: Apply reinforce step
,
            ΔVRAM Allocated: 2.288818359375e-05 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:27,
            Percentage of VRAM taken: 69.91579512747062%,
            
[2025-08-20 20:48:02,056][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/policy_optimizer_state.pt
[2025-08-20 20:48:02,057][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/critic_optimizer_state.pt
[2025-08-20 20:48:03,274][__main__][INFO] - Iteration 815 took 55s (38.67% Gen, 61.33% Train). Generation: 21s, Training: 34s. Estimated remaining time: 2h 50m 41s. Estimated total time: 15h 28m 13s. Time estimates for 10 more iterations: 9m 16s, 100 more iterations: 1h 32m 49s, 500 more iterations: 7h 44m 6s.
[2025-08-20 20:48:03,275][__main__][INFO] - Starting iteration 815.
[2025-08-20 20:48:27,476][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 20:48:27,477][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 20:48:27,484][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 20:48:29,930][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 20:48:29,931][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 20:48:29,938][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 20:48:29,940][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data.
[2025-08-20 20:48:29,940][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets.
[2025-08-20 20:48:30,241][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:48:31,032][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:48:31,823][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:48:32,616][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:48:33,408][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:48:34,199][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:48:34,994][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:48:35,786][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:48:36,578][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:48:37,372][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:48:38,164][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:48:38,956][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:48:39,751][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:48:40,543][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:48:41,335][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:48:42,131][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:48:42,924][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:48:43,717][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:48:44,513][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:48:45,308][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:48:46,102][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:48:47,362][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:48:48,155][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:48:48,952][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:48:49,747][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:48:50,541][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:48:51,335][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:48:52,132][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:48:52,927][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:48:53,720][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:48:54,514][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:48:55,309][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:48:56,931][mllm.training.trainer_common][INFO] - For task: Apply reinforce step
,
            ΔVRAM Allocated: 2.288818359375e-05 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:26,
            Percentage of VRAM taken: 69.91579512747062%,
            
[2025-08-20 20:48:57,885][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/policy_optimizer_state.pt
[2025-08-20 20:48:57,886][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/critic_optimizer_state.pt
[2025-08-20 20:48:59,260][__main__][INFO] - Iteration 816 took 55s (38.85% Gen, 61.15% Train). Generation: 21s, Training: 34s. Estimated remaining time: 2h 54m 36s. Estimated total time: 15h 33m 4s. Time estimates for 10 more iterations: 9m 19s, 100 more iterations: 1h 33m 18s, 500 more iterations: 7h 46m 32s.
[2025-08-20 20:48:59,262][__main__][INFO] - Starting iteration 816.
[2025-08-20 20:49:22,401][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 20:49:22,402][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 20:49:22,408][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 20:49:24,854][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 20:49:24,855][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 20:49:24,862][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 20:49:24,864][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data.
[2025-08-20 20:49:24,864][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets.
[2025-08-20 20:49:25,164][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:49:25,957][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:49:26,749][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:49:27,543][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:49:28,338][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:49:29,132][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:49:29,925][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:49:30,719][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:49:31,514][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:49:32,308][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:49:33,101][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:49:33,897][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:49:34,692][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:49:35,485][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:49:36,279][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:49:37,075][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:49:37,871][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:49:38,665][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:49:39,459][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:49:40,256][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:49:41,054][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:49:42,394][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:49:43,190][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:49:43,985][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:49:44,781][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:49:45,577][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:49:46,376][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:49:47,175][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:49:47,973][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:49:48,770][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:49:49,566][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:49:50,361][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:49:51,974][mllm.training.trainer_common][INFO] - For task: Apply reinforce step
,
            ΔVRAM Allocated: 2.288818359375e-05 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:27,
            Percentage of VRAM taken: 69.91579512747062%,
            
[2025-08-20 20:49:52,885][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/policy_optimizer_state.pt
[2025-08-20 20:49:52,887][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/critic_optimizer_state.pt
[2025-08-20 20:49:54,217][__main__][INFO] - Iteration 817 took 54s (37.67% Gen, 62.33% Train). Generation: 20s, Training: 34s. Estimated remaining time: 2h 36m 31s. Estimated total time: 15h 15m 54s. Time estimates for 10 more iterations: 9m 9s, 100 more iterations: 1h 31m 35s, 500 more iterations: 7h 37m 57s.
[2025-08-20 20:49:54,218][__main__][INFO] - Starting iteration 817.
[2025-08-20 20:50:17,928][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 20:50:17,929][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 20:50:17,936][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 20:50:20,400][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 20:50:20,402][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 20:50:20,408][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 20:50:20,410][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data.
[2025-08-20 20:50:20,411][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets.
[2025-08-20 20:50:20,709][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:50:21,498][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:50:22,287][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:50:23,080][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:50:23,872][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:50:24,662][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:50:25,455][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:50:26,245][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:50:27,039][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:50:27,831][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:50:28,622][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:50:29,418][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:50:30,211][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:50:31,003][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:50:31,799][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:50:32,594][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:50:33,386][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:50:34,180][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:50:34,976][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:50:35,770][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:50:36,564][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:50:37,359][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:50:38,678][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:50:39,471][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:50:40,265][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:50:41,059][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:50:41,855][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:50:42,650][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:50:43,443][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:50:44,238][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:50:45,035][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:50:45,831][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:50:47,492][mllm.training.trainer_common][INFO] - For task: Apply reinforce step
,
            ΔVRAM Allocated: 2.288818359375e-05 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:27,
            Percentage of VRAM taken: 69.91579512747062%,
            
[2025-08-20 20:50:48,458][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/policy_optimizer_state.pt
[2025-08-20 20:50:48,459][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/critic_optimizer_state.pt
[2025-08-20 20:50:49,852][__main__][INFO] - Iteration 818 took 55s (38.22% Gen, 61.78% Train). Generation: 21s, Training: 34s. Estimated remaining time: 2h 46m 54s. Estimated total time: 15h 27m 13s. Time estimates for 10 more iterations: 9m 16s, 100 more iterations: 1h 32m 43s, 500 more iterations: 7h 43m 36s.
[2025-08-20 20:50:49,853][__main__][INFO] - Starting iteration 818.
[2025-08-20 20:51:13,704][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 20:51:13,705][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 20:51:13,711][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 20:51:16,159][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 20:51:16,161][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 20:51:16,167][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 20:51:16,170][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data.
[2025-08-20 20:51:16,170][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets.
[2025-08-20 20:51:16,469][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:51:17,262][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:51:18,055][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:51:18,848][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:51:19,644][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:51:20,438][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:51:21,231][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:51:22,026][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:51:22,822][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:51:23,617][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:51:24,410][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:51:25,204][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:51:26,000][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:51:26,797][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:51:27,590][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:51:28,384][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:51:29,182][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:51:29,980][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:51:30,776][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:51:32,009][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:51:32,804][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:51:33,599][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:51:34,395][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:51:35,191][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:51:35,985][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:51:36,782][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:51:37,580][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:51:38,378][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:51:39,175][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:51:39,970][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:51:40,766][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:51:41,562][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:51:43,191][mllm.training.trainer_common][INFO] - For task: Apply reinforce step
,
            ΔVRAM Allocated: 2.288818359375e-05 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:26,
            Percentage of VRAM taken: 69.91579512747062%,
            
[2025-08-20 20:51:44,127][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/policy_optimizer_state.pt
[2025-08-20 20:51:44,128][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/critic_optimizer_state.pt
[2025-08-20 20:51:45,461][__main__][INFO] - Iteration 819 took 55s (38.50% Gen, 61.50% Train). Generation: 21s, Training: 34s. Estimated remaining time: 2h 45m 32s. Estimated total time: 15h 26m 47s. Time estimates for 10 more iterations: 9m 16s, 100 more iterations: 1h 32m 40s, 500 more iterations: 7h 43m 23s.
[2025-08-20 20:51:45,462][__main__][INFO] - Starting iteration 819.
[2025-08-20 20:52:09,003][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 20:52:09,004][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 20:52:09,010][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 20:52:11,488][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 20:52:11,489][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 20:52:11,496][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 20:52:11,498][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data.
[2025-08-20 20:52:11,498][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets.
[2025-08-20 20:52:11,798][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:52:12,588][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:52:13,381][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:52:14,173][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:52:14,969][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:52:15,765][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:52:16,559][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:52:17,351][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:52:18,146][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:52:18,941][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:52:19,735][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:52:20,528][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:52:21,324][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:52:22,120][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:52:22,913][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:52:23,707][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:52:24,505][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:52:25,301][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:52:26,099][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:52:26,894][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:52:27,688][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:52:29,014][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:52:29,809][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:52:30,605][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:52:31,403][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:52:32,200][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:52:32,997][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:52:33,792][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:52:34,587][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:52:35,384][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:52:36,182][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:52:36,981][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:52:38,611][mllm.training.trainer_common][INFO] - For task: Apply reinforce step
,
            ΔVRAM Allocated: 2.288818359375e-05 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:27,
            Percentage of VRAM taken: 69.91579512747062%,
            
[2025-08-20 20:52:39,608][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/policy_optimizer_state.pt
[2025-08-20 20:52:39,610][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/critic_optimizer_state.pt
[2025-08-20 20:52:41,441][__main__][INFO] - Iteration 820 took 55s (37.64% Gen, 62.36% Train). Generation: 21s, Training: 34s. Estimated remaining time: 2h 50m 48s. Estimated total time: 15h 32m 58s. Time estimates for 10 more iterations: 9m 19s, 100 more iterations: 1h 33m 17s, 500 more iterations: 7h 46m 29s.
[2025-08-20 20:52:41,443][__main__][INFO] - Starting iteration 820.
[2025-08-20 20:53:05,176][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 20:53:05,178][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 20:53:05,184][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 20:53:07,618][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 20:53:07,619][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 20:53:07,626][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 20:53:07,628][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data.
[2025-08-20 20:53:07,628][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets.
[2025-08-20 20:53:07,926][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:53:08,716][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:53:09,509][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:53:10,302][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:53:11,094][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:53:11,889][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:53:12,683][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:53:13,475][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:53:14,269][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:53:15,064][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:53:15,859][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:53:16,652][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:53:17,448][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:53:18,243][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:53:19,037][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:53:19,832][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:53:20,629][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:53:21,426][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:53:22,223][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:53:23,018][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:53:23,813][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:53:24,608][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:53:25,406][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:53:26,719][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:53:27,511][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:53:28,306][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:53:29,104][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:53:29,900][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:53:30,694][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:53:31,488][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:53:32,282][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:53:33,079][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:53:34,711][mllm.training.trainer_common][INFO] - For task: Apply reinforce step
,
            ΔVRAM Allocated: 2.288818359375e-05 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:27,
            Percentage of VRAM taken: 69.91579512747062%,
            
[2025-08-20 20:53:35,674][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/policy_optimizer_state.pt
[2025-08-20 20:53:35,675][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/critic_optimizer_state.pt
[2025-08-20 20:53:37,043][__main__][INFO] - Iteration 821 took 55s (38.31% Gen, 61.69% Train). Generation: 21s, Training: 34s. Estimated remaining time: 2h 43m 33s. Estimated total time: 15h 26m 39s. Time estimates for 10 more iterations: 9m 15s, 100 more iterations: 1h 32m 39s, 500 more iterations: 7h 43m 19s.
[2025-08-20 20:53:37,044][__main__][INFO] - Starting iteration 821.
[2025-08-20 20:54:01,149][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 20:54:01,150][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 20:54:01,156][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 20:54:03,631][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 20:54:03,632][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 20:54:03,638][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 20:54:03,641][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data.
[2025-08-20 20:54:03,641][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets.
[2025-08-20 20:54:03,939][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:54:04,730][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:54:05,521][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:54:06,313][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:54:07,109][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:54:07,901][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:54:08,694][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:54:09,491][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:54:10,287][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:54:11,079][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:54:11,873][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:54:12,670][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:54:13,467][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:54:14,261][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:54:15,056][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:54:15,852][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:54:16,647][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:54:17,442][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:54:18,237][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:54:19,032][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:54:19,829][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:54:20,627][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:54:21,423][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:54:22,218][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:54:23,014][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:54:24,253][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:54:25,046][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:54:25,843][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:54:26,637][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:54:27,431][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:54:28,227][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:54:29,025][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:54:30,655][mllm.training.trainer_common][INFO] - For task: Apply reinforce step
,
            ΔVRAM Allocated: 2.288818359375e-05 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:26,
            Percentage of VRAM taken: 69.91579512747062%,
            
[2025-08-20 20:54:31,589][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/policy_optimizer_state.pt
[2025-08-20 20:54:31,590][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/critic_optimizer_state.pt
[2025-08-20 20:54:32,991][__main__][INFO] - Iteration 822 took 55s (38.67% Gen, 61.33% Train). Generation: 21s, Training: 34s. Estimated remaining time: 2h 48m 24s. Estimated total time: 15h 32m 26s. Time estimates for 10 more iterations: 9m 19s, 100 more iterations: 1h 33m 14s, 500 more iterations: 7h 46m 13s.
[2025-08-20 20:54:32,992][__main__][INFO] - Starting iteration 822.
[2025-08-20 20:54:56,872][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 20:54:56,874][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 20:54:56,880][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 20:54:59,318][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 20:54:59,320][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 20:54:59,326][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 20:54:59,328][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data.
[2025-08-20 20:54:59,329][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets.
[2025-08-20 20:54:59,630][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:55:00,417][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:55:01,211][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:55:02,002][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:55:02,795][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:55:03,588][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:55:04,381][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:55:05,174][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:55:05,968][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:55:06,761][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:55:07,555][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:55:08,350][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:55:09,144][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:55:09,936][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:55:10,733][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:55:11,529][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:55:12,323][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:55:13,116][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:55:13,913][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:55:14,710][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:55:15,504][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:55:16,298][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:55:17,094][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:55:17,892][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:55:19,229][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:55:20,025][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:55:20,820][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:55:21,615][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:55:22,411][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:55:23,209][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:55:24,007][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:55:24,804][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:55:26,433][mllm.training.trainer_common][INFO] - For task: Apply reinforce step
,
            ΔVRAM Allocated: 2.288818359375e-05 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:27,
            Percentage of VRAM taken: 69.91579512747062%,
            
[2025-08-20 20:55:27,409][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/policy_optimizer_state.pt
[2025-08-20 20:55:27,410][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/critic_optimizer_state.pt
[2025-08-20 20:55:28,709][__main__][INFO] - Iteration 823 took 55s (38.49% Gen, 61.51% Train). Generation: 21s, Training: 34s. Estimated remaining time: 2h 43m 38s. Estimated total time: 15h 28m 36s. Time estimates for 10 more iterations: 9m 17s, 100 more iterations: 1h 32m 51s, 500 more iterations: 7h 44m 18s.
[2025-08-20 20:55:28,711][__main__][INFO] - Starting iteration 823.
[2025-08-20 20:55:52,579][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 20:55:52,581][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 20:55:52,587][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 20:55:55,043][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 20:55:55,045][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 20:55:55,051][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 20:55:55,053][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data.
[2025-08-20 20:55:55,053][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets.
[2025-08-20 20:55:55,352][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:55:56,142][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:55:56,937][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:55:57,730][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:55:58,525][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:55:59,319][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:56:00,115][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:56:00,910][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:56:01,704][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:56:02,499][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:56:03,294][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:56:04,090][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:56:04,884][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:56:05,679][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:56:06,477][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:56:07,275][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:56:08,073][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:56:08,869][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:56:09,664][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:56:10,459][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:56:11,256][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:56:12,054][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:56:12,854][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:56:13,651][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:56:14,446][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:56:15,241][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:56:16,037][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:56:16,834][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:56:17,633][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:56:18,431][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:56:19,228][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:56:20,024][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:56:21,760][mllm.training.trainer_common][INFO] - For task: Apply reinforce step
,
            ΔVRAM Allocated: 2.288818359375e-05 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:26,
            Percentage of VRAM taken: 69.91579512747062%,
            
[2025-08-20 20:56:22,727][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/policy_optimizer_state.pt
[2025-08-20 20:56:22,729][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/critic_optimizer_state.pt
[2025-08-20 20:56:24,269][__main__][INFO] - Iteration 824 took 55s (38.57% Gen, 61.43% Train). Generation: 21s, Training: 34s. Estimated remaining time: 2h 40m 4s. Estimated total time: 15h 25m 57s. Time estimates for 10 more iterations: 9m 15s, 100 more iterations: 1h 32m 35s, 500 more iterations: 7h 42m 58s.
[2025-08-20 20:56:24,270][__main__][INFO] - Starting iteration 824.
[2025-08-20 20:56:47,977][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 20:56:47,979][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 20:56:47,985][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 20:56:50,441][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 20:56:50,443][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 20:56:50,449][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 20:56:50,451][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data.
[2025-08-20 20:56:50,452][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets.
[2025-08-20 20:56:50,750][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:56:51,540][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:56:52,332][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:56:53,125][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:56:53,921][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:56:54,715][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:56:55,508][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:56:56,302][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:56:57,099][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:56:57,896][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:56:58,690][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:56:59,484][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:57:00,281][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:57:01,079][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:57:01,876][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:57:02,672][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:57:03,467][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:57:04,262][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:57:05,059][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:57:05,858][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:57:06,657][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:57:07,454][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:57:08,251][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:57:09,045][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:57:09,840][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:57:10,637][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:57:11,436][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:57:12,234][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:57:13,032][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:57:13,827][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:57:15,065][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:57:15,858][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:57:17,438][mllm.training.trainer_common][INFO] - For task: Apply reinforce step
,
            ΔVRAM Allocated: 2.288818359375e-05 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:26,
            Percentage of VRAM taken: 69.91579512747062%,
            
[2025-08-20 20:57:18,345][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/policy_optimizer_state.pt
[2025-08-20 20:57:18,347][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/critic_optimizer_state.pt
[2025-08-20 20:57:19,686][__main__][INFO] - Iteration 825 took 55s (38.40% Gen, 61.60% Train). Generation: 21s, Training: 34s. Estimated remaining time: 2h 36m 47s. Estimated total time: 15h 23m 35s. Time estimates for 10 more iterations: 9m 14s, 100 more iterations: 1h 32m 21s, 500 more iterations: 7h 41m 47s.
[2025-08-20 20:57:19,688][__main__][INFO] - Starting iteration 825.
[2025-08-20 20:57:43,654][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 20:57:43,656][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 20:57:43,662][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 20:57:46,114][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 20:57:46,115][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 20:57:46,121][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 20:57:46,124][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data.
[2025-08-20 20:57:46,124][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets.
[2025-08-20 20:57:46,425][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:57:47,220][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:57:48,012][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:57:48,805][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:57:49,601][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:57:50,396][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:57:51,190][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:57:51,985][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:57:52,782][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:57:53,579][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:57:54,373][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:57:55,168][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:57:55,963][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:57:56,761][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:57:57,557][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:57:58,352][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:57:59,147][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:57:59,942][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:58:00,740][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:58:01,538][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:58:02,832][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:58:03,625][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:58:04,421][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:58:05,218][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:58:06,015][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:58:06,810][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:58:07,605][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:58:08,401][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:58:09,199][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:58:09,998][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:58:10,796][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:58:11,592][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:58:13,179][mllm.training.trainer_common][INFO] - For task: Apply reinforce step
,
            ΔVRAM Allocated: 2.288818359375e-05 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:27,
            Percentage of VRAM taken: 69.91579512747062%,
            
[2025-08-20 20:58:14,190][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/policy_optimizer_state.pt
[2025-08-20 20:58:14,192][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/critic_optimizer_state.pt
[2025-08-20 20:58:15,490][__main__][INFO] - Iteration 826 took 55s (38.57% Gen, 61.43% Train). Generation: 21s, Training: 34s. Estimated remaining time: 2h 42m 17s. Estimated total time: 15h 30m 2s. Time estimates for 10 more iterations: 9m 18s, 100 more iterations: 1h 33m 0s, 500 more iterations: 7h 45m 1s.
[2025-08-20 20:58:15,492][__main__][INFO] - Starting iteration 826.
[2025-08-20 20:58:39,687][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 20:58:39,689][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 20:58:39,695][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 20:58:42,137][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 20:58:42,138][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 20:58:42,144][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 20:58:42,147][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data.
[2025-08-20 20:58:42,147][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets.
[2025-08-20 20:58:42,450][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:58:43,240][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:58:44,030][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:58:44,824][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:58:45,616][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:58:46,407][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:58:47,202][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:58:47,994][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:58:48,787][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:58:49,583][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:58:50,377][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:58:51,169][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:58:51,964][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:58:52,759][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:58:53,551][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:58:54,345][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:58:55,141][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:58:55,935][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:58:56,728][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:58:57,523][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:58:58,319][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:58:59,113][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:58:59,906][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:59:01,211][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:59:02,005][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:59:02,802][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:59:03,601][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:59:04,399][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:59:05,196][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:59:05,990][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:59:06,783][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:59:07,580][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:59:09,175][mllm.training.trainer_common][INFO] - For task: Apply reinforce step
,
            ΔVRAM Allocated: 2.288818359375e-05 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:26,
            Percentage of VRAM taken: 69.91579512747062%,
            
[2025-08-20 20:59:10,128][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/policy_optimizer_state.pt
[2025-08-20 20:59:10,129][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/critic_optimizer_state.pt
[2025-08-20 20:59:11,567][__main__][INFO] - Iteration 827 took 56s (38.84% Gen, 61.16% Train). Generation: 21s, Training: 34s. Estimated remaining time: 2h 45m 54s. Estimated total time: 15h 34m 34s. Time estimates for 10 more iterations: 9m 20s, 100 more iterations: 1h 33m 27s, 500 more iterations: 7h 47m 17s.
[2025-08-20 20:59:11,568][__main__][INFO] - Starting iteration 827.
[2025-08-20 20:59:35,626][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 20:59:35,627][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 20:59:35,634][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 20:59:38,091][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 20:59:38,093][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 20:59:38,099][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 20:59:38,101][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data.
[2025-08-20 20:59:38,102][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets.
[2025-08-20 20:59:38,399][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:59:39,190][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:59:39,984][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:59:40,777][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:59:41,571][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:59:42,367][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:59:43,164][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:59:43,957][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:59:44,752][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:59:45,546][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:59:46,342][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:59:47,135][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:59:47,929][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:59:48,726][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:59:49,524][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:59:50,319][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:59:51,114][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:59:51,909][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:59:52,707][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:59:53,504][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:59:54,301][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:59:55,096][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:59:55,891][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:59:56,687][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:59:57,485][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:59:58,284][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 20:59:59,083][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:00:00,375][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:00:01,169][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:00:01,963][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:00:02,761][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:00:03,556][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:00:05,132][mllm.training.trainer_common][INFO] - For task: Apply reinforce step
,
            ΔVRAM Allocated: 2.288818359375e-05 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:26,
            Percentage of VRAM taken: 69.91579512747062%,
            
[2025-08-20 21:00:06,068][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/policy_optimizer_state.pt
[2025-08-20 21:00:06,069][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/critic_optimizer_state.pt
[2025-08-20 21:00:09,050][__main__][INFO] - Iteration 828 took 57s (37.59% Gen, 62.41% Train). Generation: 21s, Training: 35s. Estimated remaining time: 3h 8m 22s. Estimated total time: 15h 58m 0s. Time estimates for 10 more iterations: 9m 34s, 100 more iterations: 1h 35m 48s, 500 more iterations: 7h 59m 0s.
[2025-08-20 21:00:09,051][__main__][INFO] - Starting iteration 828.
[2025-08-20 21:00:33,156][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 21:00:33,157][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 21:00:33,163][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 21:00:35,627][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 21:00:35,628][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 21:00:35,635][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 21:00:35,637][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data.
[2025-08-20 21:00:35,637][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets.
[2025-08-20 21:00:35,937][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:00:36,729][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:00:37,523][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:00:38,316][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:00:39,111][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:00:39,908][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:00:40,703][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:00:41,496][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:00:42,291][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:00:43,088][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:00:43,884][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:00:44,678][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:00:45,473][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:00:46,269][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:00:47,067][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:00:47,863][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:00:48,658][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:00:49,452][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:00:50,249][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:00:51,047][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:00:51,845][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:00:52,641][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:00:53,437][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:00:54,232][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:00:55,027][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:00:55,824][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:00:56,620][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:00:57,887][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:00:58,679][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:00:59,471][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:01:00,264][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:01:01,058][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:01:02,625][mllm.training.trainer_common][INFO] - For task: Apply reinforce step
,
            ΔVRAM Allocated: 2.288818359375e-05 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:26,
            Percentage of VRAM taken: 69.91579512747062%,
            
[2025-08-20 21:01:03,571][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/policy_optimizer_state.pt
[2025-08-20 21:01:03,572][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/critic_optimizer_state.pt
[2025-08-20 21:01:05,728][__main__][INFO] - Iteration 829 took 56s (38.19% Gen, 61.81% Train). Generation: 21s, Training: 35s. Estimated remaining time: 2h 54m 1s. Estimated total time: 15h 44m 36s. Time estimates for 10 more iterations: 9m 26s, 100 more iterations: 1h 34m 27s, 500 more iterations: 7h 52m 18s.
[2025-08-20 21:01:05,730][__main__][INFO] - Starting iteration 829.
[2025-08-20 21:01:29,933][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 21:01:29,935][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 21:01:29,941][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 21:01:32,400][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 21:01:32,401][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 21:01:32,407][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 21:01:32,410][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data.
[2025-08-20 21:01:32,410][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets.
[2025-08-20 21:01:32,709][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:01:33,498][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:01:34,292][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:01:35,085][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:01:35,879][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:01:36,672][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:01:37,468][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:01:38,260][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:01:39,054][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:01:39,851][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:01:40,647][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:01:41,440][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:01:42,234][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:01:43,030][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:01:43,826][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:01:44,620][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:01:45,414][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:01:46,211][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:01:47,009][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:01:47,806][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:01:48,603][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:01:49,396][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:01:50,192][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:01:50,989][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:01:51,785][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:01:53,104][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:01:53,899][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:01:54,691][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:01:55,486][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:01:56,282][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:01:57,077][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:01:57,871][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:01:59,505][mllm.training.trainer_common][INFO] - For task: Apply reinforce step
,
            ΔVRAM Allocated: 2.288818359375e-05 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:27,
            Percentage of VRAM taken: 69.91579512747062%,
            
[2025-08-20 21:02:00,789][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/policy_optimizer_state.pt
[2025-08-20 21:02:00,791][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/critic_optimizer_state.pt
[2025-08-20 21:02:02,059][__main__][INFO] - Iteration 830 took 56s (38.59% Gen, 61.41% Train). Generation: 21s, Training: 34s. Estimated remaining time: 2h 47m 18s. Estimated total time: 15h 38m 49s. Time estimates for 10 more iterations: 9m 23s, 100 more iterations: 1h 33m 52s, 500 more iterations: 7h 49m 24s.
[2025-08-20 21:02:02,061][__main__][INFO] - Starting iteration 830.
[2025-08-20 21:02:26,060][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 21:02:26,061][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 21:02:26,067][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 21:02:28,524][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 21:02:28,525][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 21:02:28,531][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 21:02:28,534][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data.
[2025-08-20 21:02:28,534][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets.
[2025-08-20 21:02:28,833][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:02:29,625][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:02:30,418][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:02:31,215][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:02:32,011][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:02:32,804][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:02:33,599][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:02:34,395][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:02:35,193][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:02:35,989][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:02:36,785][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:02:37,580][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:02:38,376][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:02:39,174][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:02:39,972][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:02:40,770][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:02:41,565][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:02:42,361][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:02:43,157][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:02:43,955][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:02:44,753][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:02:45,550][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:02:46,346][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:02:47,140][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:02:47,934][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:02:48,731][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:02:49,529][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:02:50,325][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:02:51,120][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:02:52,338][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:02:53,132][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:02:53,925][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:02:55,510][mllm.training.trainer_common][INFO] - For task: Apply reinforce step
,
            ΔVRAM Allocated: 2.288818359375e-05 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:26,
            Percentage of VRAM taken: 69.91579512747062%,
            
[2025-08-20 21:02:56,426][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/policy_optimizer_state.pt
[2025-08-20 21:02:56,427][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/critic_optimizer_state.pt
[2025-08-20 21:02:57,804][__main__][INFO] - Iteration 831 took 55s (38.67% Gen, 61.33% Train). Generation: 21s, Training: 34s. Estimated remaining time: 2h 36m 36s. Estimated total time: 15h 29m 2s. Time estimates for 10 more iterations: 9m 17s, 100 more iterations: 1h 32m 54s, 500 more iterations: 7h 44m 31s.
[2025-08-20 21:02:57,805][__main__][INFO] - Starting iteration 831.
[2025-08-20 21:03:22,022][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 21:03:22,023][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 21:03:22,029][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 21:03:24,480][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 21:03:24,481][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 21:03:24,487][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 21:03:24,490][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data.
[2025-08-20 21:03:24,490][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets.
[2025-08-20 21:03:24,791][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:03:25,583][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:03:26,377][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:03:27,172][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:03:27,966][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:03:28,760][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:03:29,558][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:03:30,356][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:03:31,152][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:03:31,946][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:03:32,740][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:03:33,535][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:03:34,332][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:03:35,127][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:03:35,922][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:03:36,718][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:03:37,516][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:03:38,314][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:03:39,111][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:03:39,905][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:03:40,699][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:03:41,493][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:03:42,289][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:03:43,626][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:03:44,418][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:03:45,213][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:03:46,010][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:03:46,804][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:03:47,597][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:03:48,392][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:03:49,188][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:03:49,982][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:03:51,578][mllm.training.trainer_common][INFO] - For task: Apply reinforce step
,
            ΔVRAM Allocated: 2.288818359375e-05 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:27,
            Percentage of VRAM taken: 69.91579512747062%,
            
[2025-08-20 21:03:52,565][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/policy_optimizer_state.pt
[2025-08-20 21:03:52,566][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/critic_optimizer_state.pt
[2025-08-20 21:03:53,816][__main__][INFO] - Iteration 832 took 56s (38.88% Gen, 61.12% Train). Generation: 21s, Training: 34s. Estimated remaining time: 2h 40m 7s. Estimated total time: 15h 33m 30s. Time estimates for 10 more iterations: 9m 20s, 100 more iterations: 1h 33m 21s, 500 more iterations: 7h 46m 45s.
[2025-08-20 21:03:53,818][__main__][INFO] - Starting iteration 832.
[2025-08-20 21:04:18,379][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 21:04:18,380][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 21:04:18,387][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 21:04:20,841][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 21:04:20,842][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 21:04:20,848][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 21:04:20,850][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data.
[2025-08-20 21:04:20,851][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets.
[2025-08-20 21:04:21,150][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:04:21,940][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:04:22,732][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:04:23,524][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:04:24,320][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:04:25,115][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:04:25,907][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:04:26,700][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:04:27,496][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:04:28,289][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:04:29,083][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:04:29,879][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:04:30,675][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:04:31,470][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:04:32,264][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:04:33,059][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:04:33,857][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:04:34,652][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:04:35,444][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:04:36,237][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:04:37,033][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:04:37,826][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:04:38,619][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:04:39,415][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:04:40,787][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:04:41,581][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:04:42,377][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:04:43,174][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:04:43,971][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:04:44,766][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:04:45,560][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:04:46,355][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:04:47,955][mllm.training.trainer_common][INFO] - For task: Apply reinforce step
,
            ΔVRAM Allocated: 2.288818359375e-05 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:27,
            Percentage of VRAM taken: 69.91579512747062%,
            
[2025-08-20 21:04:48,951][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/policy_optimizer_state.pt
[2025-08-20 21:04:48,952][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/critic_optimizer_state.pt
[2025-08-20 21:04:50,306][__main__][INFO] - Iteration 833 took 56s (39.16% Gen, 60.84% Train). Generation: 22s, Training: 34s. Estimated remaining time: 2h 47m 9s. Estimated total time: 15h 41m 28s. Time estimates for 10 more iterations: 9m 24s, 100 more iterations: 1h 34m 8s, 500 more iterations: 7h 50m 44s.
[2025-08-20 21:04:50,308][__main__][INFO] - Starting iteration 833.
[2025-08-20 21:05:14,627][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 21:05:14,628][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 21:05:14,635][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 21:05:17,105][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 21:05:17,107][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 21:05:17,113][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 21:05:17,116][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data.
[2025-08-20 21:05:17,116][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets.
[2025-08-20 21:05:17,415][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:05:18,206][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:05:19,002][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:05:19,797][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:05:20,591][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:05:21,385][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:05:22,182][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:05:22,979][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:05:23,774][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:05:24,568][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:05:25,364][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:05:26,161][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:05:26,959][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:05:27,756][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:05:28,551][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:05:29,346][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:05:30,140][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:05:30,937][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:05:31,732][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:05:32,525][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:05:33,320][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:05:34,117][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:05:34,913][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:05:35,708][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:05:36,502][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:05:37,709][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:05:38,503][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:05:39,299][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:05:40,097][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:05:40,894][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:05:41,691][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:05:42,485][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:05:44,062][mllm.training.trainer_common][INFO] - For task: Apply reinforce step
,
            ΔVRAM Allocated: 2.288818359375e-05 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:26,
            Percentage of VRAM taken: 69.91579512747062%,
            
[2025-08-20 21:05:44,968][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/policy_optimizer_state.pt
[2025-08-20 21:05:44,969][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/critic_optimizer_state.pt
[2025-08-20 21:05:47,338][__main__][INFO] - Iteration 834 took 57s (38.33% Gen, 61.67% Train). Generation: 21s, Training: 35s. Estimated remaining time: 2h 55m 13s. Estimated total time: 15h 50m 29s. Time estimates for 10 more iterations: 9m 30s, 100 more iterations: 1h 35m 2s, 500 more iterations: 7h 55m 14s.
[2025-08-20 21:05:47,340][__main__][INFO] - Starting iteration 834.
[2025-08-20 21:06:11,626][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 21:06:11,627][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 21:06:11,633][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 21:06:14,093][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 21:06:14,094][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 21:06:14,101][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 21:06:14,103][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data.
[2025-08-20 21:06:14,103][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets.
[2025-08-20 21:06:14,403][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:06:15,194][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:06:15,988][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:06:16,784][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:06:17,580][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:06:18,374][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:06:19,168][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:06:19,965][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:06:20,762][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:06:21,558][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:06:22,352][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:06:23,147][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:06:23,943][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:06:24,739][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:06:25,532][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:06:26,325][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:06:27,121][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:06:27,916][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:06:28,710][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:06:29,504][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:06:30,299][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:06:31,094][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:06:31,888][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:06:32,682][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:06:33,479][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:06:34,275][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:06:35,550][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:06:36,343][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:06:37,139][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:06:37,935][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:06:38,728][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:06:39,521][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:06:41,096][mllm.training.trainer_common][INFO] - For task: Apply reinforce step
,
            ΔVRAM Allocated: 2.288818359375e-05 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:26,
            Percentage of VRAM taken: 69.91579512747062%,
            
[2025-08-20 21:06:42,063][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/policy_optimizer_state.pt
[2025-08-20 21:06:42,065][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/critic_optimizer_state.pt
[2025-08-20 21:06:43,408][__main__][INFO] - Iteration 835 took 56s (38.90% Gen, 61.10% Train). Generation: 21s, Training: 34s. Estimated remaining time: 2h 38m 15s. Estimated total time: 15h 34m 28s. Time estimates for 10 more iterations: 9m 20s, 100 more iterations: 1h 33m 26s, 500 more iterations: 7h 47m 14s.
[2025-08-20 21:06:43,410][__main__][INFO] - Starting iteration 835.
[2025-08-20 21:07:07,835][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 21:07:07,836][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 21:07:07,843][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 21:07:10,298][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 21:07:10,300][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 21:07:10,306][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 21:07:10,308][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data.
[2025-08-20 21:07:10,309][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets.
[2025-08-20 21:07:10,608][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:07:11,400][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:07:12,192][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:07:12,987][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:07:13,784][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:07:14,578][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:07:15,371][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:07:16,167][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:07:16,964][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:07:17,760][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:07:18,553][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:07:19,346][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:07:20,143][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:07:20,938][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:07:21,731][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:07:22,525][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:07:23,322][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:07:24,117][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:07:24,910][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:07:25,705][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:07:26,502][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:07:27,298][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:07:28,092][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:07:28,886][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:07:29,683][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:07:30,481][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:07:31,278][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:07:32,073][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:07:32,867][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:07:34,199][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:07:34,996][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:07:35,790][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:07:37,397][mllm.training.trainer_common][INFO] - For task: Apply reinforce step
,
            ΔVRAM Allocated: 2.288818359375e-05 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:27,
            Percentage of VRAM taken: 69.91579512747062%,
            
[2025-08-20 21:07:38,355][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/policy_optimizer_state.pt
[2025-08-20 21:07:38,356][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/critic_optimizer_state.pt
[2025-08-20 21:07:39,609][__main__][INFO] - Iteration 836 took 56s (39.08% Gen, 60.91% Train). Generation: 21s, Training: 34s. Estimated remaining time: 2h 39m 30s. Estimated total time: 15h 36m 38s. Time estimates for 10 more iterations: 9m 21s, 100 more iterations: 1h 33m 39s, 500 more iterations: 7h 48m 19s.
[2025-08-20 21:07:39,610][__main__][INFO] - Starting iteration 836.
[2025-08-20 21:08:03,982][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 21:08:03,983][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 21:08:03,989][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 21:08:06,441][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 21:08:06,442][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 21:08:06,449][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 21:08:06,451][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data.
[2025-08-20 21:08:06,451][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets.
[2025-08-20 21:08:06,757][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:08:07,549][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:08:08,344][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:08:09,137][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:08:09,932][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:08:10,729][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:08:11,525][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:08:12,319][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:08:13,112][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:08:13,907][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:08:14,702][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:08:15,494][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:08:16,288][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:08:17,083][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:08:17,877][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:08:18,670][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:08:19,464][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:08:20,258][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:08:21,051][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:08:21,846][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:08:22,643][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:08:23,440][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:08:24,233][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:08:25,447][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:08:26,243][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:08:27,039][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:08:27,832][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:08:28,627][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:08:29,424][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:08:30,221][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:08:31,015][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:08:31,809][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:08:33,377][mllm.training.trainer_common][INFO] - For task: Apply reinforce step
,
            ΔVRAM Allocated: 2.288818359375e-05 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:26,
            Percentage of VRAM taken: 69.91579512747062%,
            
[2025-08-20 21:08:34,354][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/policy_optimizer_state.pt
[2025-08-20 21:08:34,356][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/critic_optimizer_state.pt
[2025-08-20 21:08:35,735][__main__][INFO] - Iteration 837 took 56s (39.05% Gen, 60.95% Train). Generation: 21s, Training: 34s. Estimated remaining time: 2h 37m 19s. Estimated total time: 15h 35m 23s. Time estimates for 10 more iterations: 9m 21s, 100 more iterations: 1h 33m 32s, 500 more iterations: 7h 47m 41s.
[2025-08-20 21:08:35,737][__main__][INFO] - Starting iteration 837.
[2025-08-20 21:09:00,169][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 21:09:00,171][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 21:09:00,177][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 21:09:02,646][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 21:09:02,647][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 21:09:02,653][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 21:09:02,656][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data.
[2025-08-20 21:09:02,656][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets.
[2025-08-20 21:09:02,960][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:09:03,751][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:09:04,544][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:09:05,336][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:09:06,132][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:09:06,926][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:09:07,718][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:09:08,510][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:09:09,304][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:09:10,095][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:09:10,890][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:09:11,683][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:09:12,476][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:09:13,270][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:09:14,065][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:09:14,857][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:09:15,650][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:09:16,446][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:09:17,239][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:09:18,032][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:09:18,828][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:09:19,623][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:09:20,416][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:09:21,210][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:09:22,006][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:09:22,801][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:09:24,101][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:09:24,895][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:09:25,689][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:09:26,487][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:09:27,285][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:09:28,081][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:09:29,663][mllm.training.trainer_common][INFO] - For task: Apply reinforce step
,
            ΔVRAM Allocated: 2.288818359375e-05 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:26,
            Percentage of VRAM taken: 69.91579512747062%,
            
[2025-08-20 21:09:30,545][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/policy_optimizer_state.pt
[2025-08-20 21:09:30,547][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/critic_optimizer_state.pt
[2025-08-20 21:09:31,940][__main__][INFO] - Iteration 838 took 56s (39.11% Gen, 60.89% Train). Generation: 21s, Training: 34s. Estimated remaining time: 2h 37m 41s. Estimated total time: 15h 36m 42s. Time estimates for 10 more iterations: 9m 22s, 100 more iterations: 1h 33m 40s, 500 more iterations: 7h 48m 21s.
[2025-08-20 21:09:31,941][__main__][INFO] - Starting iteration 838.
[2025-08-20 21:09:56,802][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 21:09:56,803][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 21:09:56,810][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 21:09:59,320][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 21:09:59,322][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 21:09:59,329][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 21:09:59,331][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data.
[2025-08-20 21:09:59,332][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets.
[2025-08-20 21:09:59,631][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:10:00,422][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:10:01,216][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:10:02,011][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:10:02,803][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:10:03,596][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:10:04,391][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:10:05,185][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:10:05,978][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:10:06,773][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:10:07,569][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:10:08,362][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:10:09,155][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:10:09,952][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:10:10,748][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:10:11,541][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:10:12,335][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:10:13,131][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:10:13,926][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:10:14,720][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:10:15,514][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:10:16,311][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:10:17,108][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:10:17,902][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:10:18,696][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:10:19,999][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:10:20,793][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:10:21,589][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:10:22,387][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:10:23,183][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:10:23,977][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:10:24,771][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:10:26,370][mllm.training.trainer_common][INFO] - For task: Apply reinforce step
,
            ΔVRAM Allocated: 2.288818359375e-05 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:27,
            Percentage of VRAM taken: 69.91579512747062%,
            
[2025-08-20 21:10:27,267][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/policy_optimizer_state.pt
[2025-08-20 21:10:27,269][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/critic_optimizer_state.pt
[2025-08-20 21:10:28,824][__main__][INFO] - Iteration 839 took 56s (39.32% Gen, 60.68% Train). Generation: 22s, Training: 34s. Estimated remaining time: 2h 48m 4s. Estimated total time: 15h 48m 2s. Time estimates for 10 more iterations: 9m 28s, 100 more iterations: 1h 34m 48s, 500 more iterations: 7h 54m 1s.
[2025-08-20 21:10:28,826][__main__][INFO] - Starting iteration 839.
[2025-08-20 21:10:53,192][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 21:10:53,193][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 21:10:53,199][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 21:10:55,658][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 21:10:55,659][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 21:10:55,666][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 21:10:55,668][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data.
[2025-08-20 21:10:55,669][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets.
[2025-08-20 21:10:55,972][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:10:56,761][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:10:57,555][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:10:58,347][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:10:59,140][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:10:59,935][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:11:00,728][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:11:01,520][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:11:02,316][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:11:03,111][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:11:03,904][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:11:04,697][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:11:05,493][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:11:06,287][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:11:07,080][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:11:07,875][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:11:08,671][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:11:09,465][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:11:10,258][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:11:11,052][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:11:11,846][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:11:12,640][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:11:13,434][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:11:14,735][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:11:15,531][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:11:16,327][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:11:17,120][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:11:17,915][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:11:18,712][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:11:19,508][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:11:20,302][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:11:21,095][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:11:22,664][mllm.training.trainer_common][INFO] - For task: Apply reinforce step
,
            ΔVRAM Allocated: 2.288818359375e-05 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:26,
            Percentage of VRAM taken: 69.91579512747062%,
            
[2025-08-20 21:11:23,663][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/policy_optimizer_state.pt
[2025-08-20 21:11:23,665][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/critic_optimizer_state.pt
[2025-08-20 21:11:25,086][__main__][INFO] - Iteration 840 took 56s (38.95% Gen, 61.05% Train). Generation: 21s, Training: 34s. Estimated remaining time: 2h 36m 46s. Estimated total time: 15h 37m 40s. Time estimates for 10 more iterations: 9m 22s, 100 more iterations: 1h 33m 46s, 500 more iterations: 7h 48m 50s.
[2025-08-20 21:11:25,088][__main__][INFO] - Starting iteration 840.
[2025-08-20 21:11:49,572][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 21:11:49,573][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 21:11:49,579][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 21:11:52,029][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 21:11:52,030][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 21:11:52,037][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 21:11:52,039][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data.
[2025-08-20 21:11:52,040][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets.
[2025-08-20 21:11:52,339][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:11:53,131][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:11:53,922][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:11:54,716][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:11:55,508][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:11:56,300][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:11:57,094][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:11:57,887][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:11:58,680][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:11:59,477][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:12:00,271][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:12:01,064][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:12:01,858][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:12:02,655][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:12:03,449][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:12:04,243][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:12:05,037][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:12:05,834][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:12:06,630][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:12:07,422][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:12:08,217][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:12:09,014][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:12:09,808][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:12:10,602][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:12:11,396][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:12:12,194][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:12:12,990][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:12:13,786][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:12:14,580][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:12:15,375][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:12:16,173][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:12:16,969][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:12:18,621][mllm.training.trainer_common][INFO] - For task: Apply reinforce step
,
            ΔVRAM Allocated: 2.288818359375e-05 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:26,
            Percentage of VRAM taken: 69.91579512747062%,
            
[2025-08-20 21:12:19,555][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/policy_optimizer_state.pt
[2025-08-20 21:12:19,556][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/critic_optimizer_state.pt
[2025-08-20 21:12:20,815][__main__][INFO] - Iteration 841 took 55s (39.56% Gen, 60.44% Train). Generation: 22s, Training: 33s. Estimated remaining time: 2h 26m 57s. Estimated total time: 15h 28m 46s. Time estimates for 10 more iterations: 9m 17s, 100 more iterations: 1h 32m 52s, 500 more iterations: 7h 44m 23s.
[2025-08-20 21:12:20,818][__main__][INFO] - Starting iteration 841.
[2025-08-20 21:12:45,319][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 21:12:45,320][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 21:12:45,326][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 21:12:47,774][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 21:12:47,776][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 21:12:47,782][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 21:12:47,784][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data.
[2025-08-20 21:12:47,785][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets.
[2025-08-20 21:12:48,083][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:12:48,873][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:12:49,662][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:12:50,452][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:12:51,241][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:12:52,033][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:12:52,823][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:12:53,617][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:12:54,407][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:12:55,199][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:12:55,992][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:12:56,782][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:12:57,577][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:12:58,369][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:12:59,160][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:12:59,955][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:13:00,747][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:13:01,540][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:13:02,335][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:13:03,128][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:13:03,920][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:13:04,715][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:13:05,508][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:13:06,300][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:13:07,097][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:13:07,894][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:13:08,687][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:13:09,478][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:13:10,275][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:13:11,068][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:13:11,861][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:13:12,655][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:13:14,352][mllm.training.trainer_common][INFO] - For task: Apply reinforce step
,
            ΔVRAM Allocated: 2.288818359375e-05 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:26,
            Percentage of VRAM taken: 69.91579512747062%,
            
[2025-08-20 21:13:15,285][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/policy_optimizer_state.pt
[2025-08-20 21:13:15,286][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/critic_optimizer_state.pt
[2025-08-20 21:13:16,629][__main__][INFO] - Iteration 842 took 55s (39.50% Gen, 60.50% Train). Generation: 22s, Training: 33s. Estimated remaining time: 2h 27m 25s. Estimated total time: 15h 30m 11s. Time estimates for 10 more iterations: 9m 18s, 100 more iterations: 1h 33m 1s, 500 more iterations: 7h 45m 5s.
[2025-08-20 21:13:16,630][__main__][INFO] - Starting iteration 842.
[2025-08-20 21:13:41,085][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 21:13:41,086][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 21:13:41,092][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 21:13:43,566][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 21:13:43,568][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 21:13:43,574][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 21:13:43,576][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data.
[2025-08-20 21:13:43,577][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets.
[2025-08-20 21:13:43,885][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:13:44,677][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:13:45,468][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:13:46,262][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:13:47,055][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:13:47,847][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:13:48,643][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:13:49,437][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:13:50,230][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:13:51,023][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:13:51,819][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:13:52,612][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:13:53,406][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:13:54,202][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:13:54,999][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:13:55,793][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:13:56,586][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:13:57,381][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:13:58,177][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:13:59,432][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:14:00,226][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:14:01,021][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:14:01,818][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:14:02,614][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:14:03,408][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:14:04,203][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:14:05,001][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:14:05,799][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:14:06,595][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:14:07,390][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:14:08,184][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:14:08,979][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:14:10,565][mllm.training.trainer_common][INFO] - For task: Apply reinforce step
,
            ΔVRAM Allocated: 2.288818359375e-05 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:26,
            Percentage of VRAM taken: 69.91579512747062%,
            
[2025-08-20 21:14:11,539][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/policy_optimizer_state.pt
[2025-08-20 21:14:11,540][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/critic_optimizer_state.pt
[2025-08-20 21:14:12,948][__main__][INFO] - Iteration 843 took 56s (39.03% Gen, 60.97% Train). Generation: 21s, Training: 34s. Estimated remaining time: 2h 34m 55s. Estimated total time: 15h 38m 37s. Time estimates for 10 more iterations: 9m 23s, 100 more iterations: 1h 33m 51s, 500 more iterations: 7h 49m 18s.
[2025-08-20 21:14:12,950][__main__][INFO] - Starting iteration 843.
[2025-08-20 21:14:37,739][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 21:14:37,740][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 21:14:37,746][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 21:14:40,192][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 21:14:40,193][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 21:14:40,199][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 21:14:40,201][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data.
[2025-08-20 21:14:40,202][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets.
[2025-08-20 21:14:40,507][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:14:41,299][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:14:42,090][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:14:42,885][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:14:43,682][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:14:44,474][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:14:45,267][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:14:46,062][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:14:46,855][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:14:47,648][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:14:48,443][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:14:49,237][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:14:50,030][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:14:50,825][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:14:51,622][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:14:52,415][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:14:53,209][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:14:54,002][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:14:54,798][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:14:55,592][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:14:56,386][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:14:57,184][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:14:57,981][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:14:58,777][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:14:59,571][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:15:00,365][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:15:01,162][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:15:01,959][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:15:02,754][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:15:04,035][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:15:04,830][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:15:05,624][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:15:07,244][mllm.training.trainer_common][INFO] - For task: Apply reinforce step
,
            ΔVRAM Allocated: 2.288818359375e-05 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:27,
            Percentage of VRAM taken: 69.91579512747062%,
            
[2025-08-20 21:15:08,151][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/policy_optimizer_state.pt
[2025-08-20 21:15:08,152][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/critic_optimizer_state.pt
[2025-08-20 21:15:09,446][__main__][INFO] - Iteration 844 took 56s (39.55% Gen, 60.44% Train). Generation: 22s, Training: 34s. Estimated remaining time: 2h 36m 57s. Estimated total time: 15h 41m 35s. Time estimates for 10 more iterations: 9m 24s, 100 more iterations: 1h 34m 9s, 500 more iterations: 7h 50m 47s.
[2025-08-20 21:15:09,447][__main__][INFO] - Starting iteration 844.
[2025-08-20 21:15:33,857][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 21:15:33,858][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 21:15:33,864][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 21:15:36,294][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 21:15:36,295][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 21:15:36,302][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 21:15:36,304][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data.
[2025-08-20 21:15:36,304][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets.
[2025-08-20 21:15:36,608][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:15:37,399][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:15:38,192][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:15:38,987][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:15:39,781][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:15:40,574][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:15:41,368][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:15:42,163][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:15:42,956][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:15:43,749][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:15:44,546][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:15:45,341][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:15:46,134][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:15:46,928][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:15:47,724][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:15:48,517][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:15:49,310][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:15:50,106][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:15:50,902][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:15:51,696][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:15:52,489][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:15:53,285][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:15:54,081][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:15:54,874][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:15:55,669][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:15:56,465][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:15:57,758][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:15:58,551][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:15:59,345][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:16:00,142][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:16:00,938][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:16:01,733][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:16:03,344][mllm.training.trainer_common][INFO] - For task: Apply reinforce step
,
            ΔVRAM Allocated: 2.288818359375e-05 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:26,
            Percentage of VRAM taken: 69.91579512747062%,
            
[2025-08-20 21:16:04,324][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/policy_optimizer_state.pt
[2025-08-20 21:16:04,326][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/critic_optimizer_state.pt
[2025-08-20 21:16:05,686][__main__][INFO] - Iteration 845 took 56s (39.09% Gen, 60.91% Train). Generation: 21s, Training: 34s. Estimated remaining time: 2h 31m 43s. Estimated total time: 15h 37m 18s. Time estimates for 10 more iterations: 9m 22s, 100 more iterations: 1h 33m 43s, 500 more iterations: 7h 48m 39s.
[2025-08-20 21:16:05,687][__main__][INFO] - Starting iteration 845.
[2025-08-20 21:16:30,181][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 21:16:30,182][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 21:16:30,188][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 21:16:32,621][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 21:16:32,623][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 21:16:32,629][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 21:16:32,631][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data.
[2025-08-20 21:16:32,632][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets.
[2025-08-20 21:16:32,935][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:16:33,728][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:16:34,519][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:16:35,311][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:16:36,106][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:16:36,898][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:16:37,691][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:16:38,487][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:16:39,279][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:16:40,072][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:16:40,869][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:16:41,663][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:16:42,456][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:16:43,250][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:16:44,046][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:16:44,840][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:16:45,635][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:16:46,429][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:16:47,227][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:16:48,022][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:16:48,815][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:16:49,609][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:16:50,406][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:16:51,201][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:16:51,995][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:16:52,789][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:16:53,586][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:16:54,383][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:16:55,644][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:16:56,439][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:16:57,232][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:16:58,026][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:16:59,622][mllm.training.trainer_common][INFO] - For task: Apply reinforce step
,
            ΔVRAM Allocated: 2.288818359375e-05 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:26,
            Percentage of VRAM taken: 69.91579512747062%,
            
[2025-08-20 21:17:00,581][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/policy_optimizer_state.pt
[2025-08-20 21:17:00,582][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/critic_optimizer_state.pt
[2025-08-20 21:17:01,942][__main__][INFO] - Iteration 846 took 56s (39.19% Gen, 60.80% Train). Generation: 22s, Training: 34s. Estimated remaining time: 2h 31m 3s. Estimated total time: 15h 37m 34s. Time estimates for 10 more iterations: 9m 22s, 100 more iterations: 1h 33m 45s, 500 more iterations: 7h 48m 47s.
[2025-08-20 21:17:01,943][__main__][INFO] - Starting iteration 846.
[2025-08-20 21:17:26,342][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 21:17:26,344][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 21:17:26,350][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 21:17:28,808][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 21:17:28,809][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 21:17:28,815][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 21:17:28,817][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data.
[2025-08-20 21:17:28,818][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets.
[2025-08-20 21:17:29,123][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:17:29,913][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:17:30,705][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:17:31,497][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:17:32,292][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:17:33,084][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:17:33,876][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:17:34,669][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:17:35,462][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:17:36,255][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:17:37,051][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:17:37,845][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:17:38,639][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:17:39,434][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:17:40,231][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:17:41,025][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:17:41,818][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:17:42,612][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:17:43,409][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:17:44,204][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:17:44,997][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:17:45,792][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:17:46,589][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:17:47,384][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:17:48,178][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:17:48,971][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:17:49,768][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:17:50,566][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:17:51,362][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:17:52,156][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:17:53,484][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:17:54,279][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:17:55,893][mllm.training.trainer_common][INFO] - For task: Apply reinforce step
,
            ΔVRAM Allocated: 2.288818359375e-05 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:27,
            Percentage of VRAM taken: 69.91579512747062%,
            
[2025-08-20 21:17:56,792][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/policy_optimizer_state.pt
[2025-08-20 21:17:56,794][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/critic_optimizer_state.pt
[2025-08-20 21:17:58,041][__main__][INFO] - Iteration 847 took 56s (39.13% Gen, 60.87% Train). Generation: 21s, Training: 34s. Estimated remaining time: 2h 27m 30s. Estimated total time: 15h 34m 57s. Time estimates for 10 more iterations: 9m 20s, 100 more iterations: 1h 33m 29s, 500 more iterations: 7h 47m 28s.
[2025-08-20 21:17:58,042][__main__][INFO] - Starting iteration 847.
[2025-08-20 21:18:22,454][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 21:18:22,455][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 21:18:22,462][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 21:18:24,883][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 21:18:24,884][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 21:18:24,890][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 21:18:24,893][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data.
[2025-08-20 21:18:24,893][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets.
[2025-08-20 21:18:25,191][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:18:25,978][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:18:26,770][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:18:27,560][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:18:28,353][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:18:29,144][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:18:29,936][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:18:30,728][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:18:31,519][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:18:32,313][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:18:33,105][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:18:33,895][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:18:34,691][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:18:35,483][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:18:36,274][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:18:37,068][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:18:37,860][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:18:38,653][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:18:39,446][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:18:40,237][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:18:41,031][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:18:41,824][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:18:42,616][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:18:43,411][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:18:44,680][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:18:45,475][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:18:46,272][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:18:47,070][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:18:47,866][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:18:48,662][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:18:49,456][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:18:50,252][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:18:51,920][mllm.training.trainer_common][INFO] - For task: Apply reinforce step
,
            ΔVRAM Allocated: 2.288818359375e-05 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:26,
            Percentage of VRAM taken: 69.91579512747062%,
            
[2025-08-20 21:18:52,922][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/policy_optimizer_state.pt
[2025-08-20 21:18:52,924][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/critic_optimizer_state.pt
[2025-08-20 21:18:54,290][__main__][INFO] - Iteration 848 took 56s (39.09% Gen, 60.91% Train). Generation: 21s, Training: 34s. Estimated remaining time: 2h 29m 4s. Estimated total time: 15h 37m 27s. Time estimates for 10 more iterations: 9m 22s, 100 more iterations: 1h 33m 44s, 500 more iterations: 7h 48m 43s.
[2025-08-20 21:18:54,292][__main__][INFO] - Starting iteration 848.
[2025-08-20 21:19:19,024][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 21:19:19,025][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 21:19:19,031][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 21:19:21,491][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 21:19:21,492][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 21:19:21,498][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 21:19:21,500][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data.
[2025-08-20 21:19:21,501][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets.
[2025-08-20 21:19:21,805][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:19:22,596][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:19:23,388][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:19:24,181][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:19:24,977][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:19:25,770][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:19:26,562][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:19:27,357][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:19:28,151][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:19:28,945][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:19:29,739][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:19:30,536][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:19:31,331][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:19:32,124][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:19:32,918][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:19:33,715][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:19:34,510][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:19:35,303][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:19:36,097][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:19:36,893][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:19:37,689][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:19:38,482][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:19:39,276][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:19:40,074][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:19:40,870][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:19:42,132][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:19:42,929][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:19:43,723][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:19:44,517][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:19:45,314][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:19:46,113][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:19:46,911][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:19:48,533][mllm.training.trainer_common][INFO] - For task: Apply reinforce step
,
            ΔVRAM Allocated: 2.288818359375e-05 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:26,
            Percentage of VRAM taken: 69.91579512747062%,
            
[2025-08-20 21:19:49,448][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/policy_optimizer_state.pt
[2025-08-20 21:19:49,450][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/critic_optimizer_state.pt
[2025-08-20 21:19:50,845][__main__][INFO] - Iteration 849 took 56s (39.40% Gen, 60.60% Train). Generation: 22s, Training: 34s. Estimated remaining time: 2h 33m 12s. Estimated total time: 15h 42m 32s. Time estimates for 10 more iterations: 9m 25s, 100 more iterations: 1h 34m 15s, 500 more iterations: 7h 51m 16s.
[2025-08-20 21:19:50,846][__main__][INFO] - Starting iteration 849.
[2025-08-20 21:20:15,250][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 21:20:15,251][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 21:20:15,258][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 21:20:17,722][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 21:20:17,724][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 21:20:17,730][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 21:20:17,732][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data.
[2025-08-20 21:20:17,733][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets.
[2025-08-20 21:20:18,033][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:20:18,823][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:20:19,618][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:20:20,409][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:20:21,203][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:20:21,999][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:20:22,793][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:20:23,585][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:20:24,380][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:20:25,176][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:20:25,969][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:20:26,762][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:20:27,558][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:20:28,352][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:20:29,146][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:20:29,940][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:20:30,737][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:20:31,532][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:20:32,325][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:20:33,120][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:20:33,918][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:20:34,715][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:20:35,511][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:20:36,305][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:20:37,100][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:20:37,898][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:20:38,696][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:20:40,049][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:20:40,844][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:20:41,640][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:20:42,437][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:20:43,235][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:20:44,892][mllm.training.trainer_common][INFO] - For task: Apply reinforce step
,
            ΔVRAM Allocated: 2.288818359375e-05 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:27,
            Percentage of VRAM taken: 69.91579512747062%,
            
[2025-08-20 21:20:45,885][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/policy_optimizer_state.pt
[2025-08-20 21:20:45,887][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/critic_optimizer_state.pt
[2025-08-20 21:20:47,117][__main__][INFO] - Iteration 850 took 56s (39.01% Gen, 60.98% Train). Generation: 21s, Training: 34s. Estimated remaining time: 2h 27m 34s. Estimated total time: 15h 37m 50s. Time estimates for 10 more iterations: 9m 22s, 100 more iterations: 1h 33m 47s, 500 more iterations: 7h 48m 55s.
[2025-08-20 21:20:47,118][__main__][INFO] - Starting iteration 850.
[2025-08-20 21:21:11,528][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 21:21:11,530][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 21:21:11,536][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 21:21:13,978][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 21:21:13,980][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 21:21:13,986][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 21:21:13,988][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data.
[2025-08-20 21:21:13,989][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets.
[2025-08-20 21:21:14,287][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:21:15,080][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:21:15,872][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:21:16,665][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:21:17,461][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:21:18,254][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:21:19,046][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:21:19,842][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:21:20,637][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:21:21,429][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:21:22,223][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:21:23,018][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:21:23,812][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:21:24,605][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:21:25,401][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:21:26,197][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:21:26,990][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:21:27,784][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:21:28,582][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:21:29,379][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:21:30,176][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:21:30,970][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:21:31,765][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:21:32,562][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:21:33,818][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:21:34,612][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:21:35,406][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:21:36,200][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:21:36,997][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:21:37,793][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:21:38,587][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:21:39,381][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:21:41,093][mllm.training.trainer_common][INFO] - For task: Apply reinforce step
,
            ΔVRAM Allocated: 2.288818359375e-05 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:27,
            Percentage of VRAM taken: 69.91579512747062%,
            
[2025-08-20 21:21:42,001][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/policy_optimizer_state.pt
[2025-08-20 21:21:42,002][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/critic_optimizer_state.pt
[2025-08-20 21:21:45,818][__main__][INFO] - Iteration 851 took 58s (37.43% Gen, 58.28% Train). Generation: 21s, Training: 34s. Estimated remaining time: 3h 7m 4s. Estimated total time: 16h 18m 19s. Time estimates for 10 more iterations: 9m 46s, 100 more iterations: 1h 37m 49s, 500 more iterations: 8h 9m 9s.
[2025-08-20 21:21:45,820][__main__][INFO] - Starting iteration 851.
[2025-08-20 21:22:10,133][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 21:22:10,135][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 21:22:10,141][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 21:22:12,607][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 21:22:12,608][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 21:22:12,615][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 21:22:12,617][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data.
[2025-08-20 21:22:12,617][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets.
[2025-08-20 21:22:12,916][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:22:13,704][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:22:14,495][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:22:15,286][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:22:16,079][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:22:16,870][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:22:17,664][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:22:18,456][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:22:19,248][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:22:20,042][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:22:20,835][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:22:21,627][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:22:22,423][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:22:23,218][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:22:24,010][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:22:24,804][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:22:25,602][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:22:26,398][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:22:27,192][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:22:27,986][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:22:28,783][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:22:29,581][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:22:30,378][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:22:31,172][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:22:31,965][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:22:32,761][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:22:33,558][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:22:34,806][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:22:35,601][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:22:36,399][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:22:37,194][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:22:37,989][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:22:39,602][mllm.training.trainer_common][INFO] - For task: Apply reinforce step
,
            ΔVRAM Allocated: 2.288818359375e-05 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:26,
            Percentage of VRAM taken: 69.91579512747062%,
            
[2025-08-20 21:22:40,533][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/policy_optimizer_state.pt
[2025-08-20 21:22:40,534][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/critic_optimizer_state.pt
[2025-08-20 21:22:42,051][__main__][INFO] - Iteration 852 took 56s (38.87% Gen, 61.13% Train). Generation: 21s, Training: 34s. Estimated remaining time: 2h 24m 59s. Estimated total time: 15h 37m 10s. Time estimates for 10 more iterations: 9m 22s, 100 more iterations: 1h 33m 43s, 500 more iterations: 7h 48m 35s.
[2025-08-20 21:22:42,052][__main__][INFO] - Starting iteration 852.
[2025-08-20 21:23:06,285][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 21:23:06,286][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 21:23:06,293][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 21:23:08,737][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 21:23:08,738][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 21:23:08,744][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 21:23:08,747][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data.
[2025-08-20 21:23:08,747][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets.
[2025-08-20 21:23:09,047][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:23:09,837][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:23:10,629][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:23:11,424][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:23:12,216][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:23:13,008][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:23:13,804][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:23:14,596][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:23:15,389][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:23:16,185][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:23:16,980][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:23:17,773][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:23:18,568][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:23:19,366][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:23:20,163][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:23:20,958][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:23:21,752][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:23:22,547][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:23:23,345][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:23:24,144][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:23:25,447][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:23:26,244][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:23:27,041][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:23:27,837][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:23:28,632][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:23:29,428][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:23:30,225][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:23:31,023][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:23:31,821][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:23:32,618][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:23:33,413][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:23:34,207][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:23:35,856][mllm.training.trainer_common][INFO] - For task: Apply reinforce step
,
            ΔVRAM Allocated: 2.288818359375e-05 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:27,
            Percentage of VRAM taken: 69.91579512747062%,
            
[2025-08-20 21:23:36,804][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/policy_optimizer_state.pt
[2025-08-20 21:23:36,806][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/critic_optimizer_state.pt
[2025-08-20 21:23:38,172][__main__][INFO] - Iteration 853 took 56s (38.83% Gen, 61.17% Train). Generation: 21s, Training: 34s. Estimated remaining time: 2h 22m 12s. Estimated total time: 15h 35m 19s. Time estimates for 10 more iterations: 9m 21s, 100 more iterations: 1h 33m 31s, 500 more iterations: 7h 47m 39s.
[2025-08-20 21:23:38,174][__main__][INFO] - Starting iteration 853.
[2025-08-20 21:24:02,827][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 21:24:02,829][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 21:24:02,835][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 21:24:05,275][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 21:24:05,276][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 21:24:05,282][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 21:24:05,284][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data.
[2025-08-20 21:24:05,285][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets.
[2025-08-20 21:24:05,582][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:24:06,370][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:24:07,161][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:24:07,952][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:24:08,747][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:24:09,539][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:24:10,332][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:24:11,127][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:24:11,919][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:24:12,712][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:24:13,508][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:24:14,303][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:24:15,096][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:24:15,889][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:24:16,686][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:24:17,481][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:24:18,275][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:24:19,069][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:24:19,866][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:24:20,662][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:24:21,455][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:24:22,249][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:24:23,047][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:24:23,844][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:24:24,641][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:24:25,435][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:24:26,739][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:24:27,531][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:24:28,326][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:24:29,123][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:24:29,919][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:24:30,713][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:24:32,326][mllm.training.trainer_common][INFO] - For task: Apply reinforce step
,
            ΔVRAM Allocated: 2.288818359375e-05 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:27,
            Percentage of VRAM taken: 69.91579512747062%,
            
[2025-08-20 21:24:33,352][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/policy_optimizer_state.pt
[2025-08-20 21:24:33,354][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/critic_optimizer_state.pt
[2025-08-20 21:24:34,686][__main__][INFO] - Iteration 854 took 56s (39.30% Gen, 60.70% Train). Generation: 22s, Training: 34s. Estimated remaining time: 2h 27m 47s. Estimated total time: 15h 41m 51s. Time estimates for 10 more iterations: 9m 25s, 100 more iterations: 1h 34m 11s, 500 more iterations: 7h 50m 55s.
[2025-08-20 21:24:34,687][__main__][INFO] - Starting iteration 854.
[2025-08-20 21:24:58,953][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 21:24:58,954][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 21:24:58,961][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 21:25:01,411][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 21:25:01,413][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 21:25:01,419][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 21:25:01,421][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data.
[2025-08-20 21:25:01,422][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets.
[2025-08-20 21:25:01,725][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:25:02,514][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:25:03,309][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:25:04,102][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:25:04,896][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:25:05,691][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:25:06,486][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:25:07,279][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:25:08,075][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:25:08,871][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:25:09,667][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:25:10,461][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:25:11,256][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:25:12,051][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:25:12,849][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:25:13,645][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:25:14,439][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:25:15,234][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:25:16,031][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:25:16,829][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:25:17,628][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:25:18,425][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:25:19,672][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:25:20,467][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:25:21,264][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:25:22,060][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:25:22,855][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:25:23,650][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:25:24,448][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:25:25,247][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:25:26,045][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:25:26,842][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:25:28,444][mllm.training.trainer_common][INFO] - For task: Apply reinforce step
,
            ΔVRAM Allocated: 2.288818359375e-05 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:26,
            Percentage of VRAM taken: 69.91579512747062%,
            
[2025-08-20 21:25:29,375][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/policy_optimizer_state.pt
[2025-08-20 21:25:29,376][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/critic_optimizer_state.pt
[2025-08-20 21:25:30,694][__main__][INFO] - Iteration 855 took 56s (38.95% Gen, 61.05% Train). Generation: 21s, Training: 34s. Estimated remaining time: 2h 18m 26s. Estimated total time: 15h 33m 26s. Time estimates for 10 more iterations: 9m 20s, 100 more iterations: 1h 33m 20s, 500 more iterations: 7h 46m 43s.
[2025-08-20 21:25:30,695][__main__][INFO] - Starting iteration 855.
[2025-08-20 21:25:54,997][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 21:25:54,998][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 21:25:55,004][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 21:25:57,451][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 21:25:57,453][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 21:25:57,459][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 21:25:57,461][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data.
[2025-08-20 21:25:57,462][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets.
[2025-08-20 21:25:57,766][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:25:58,556][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:25:59,351][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:26:00,144][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:26:00,938][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:26:01,733][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:26:02,528][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:26:03,321][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:26:04,116][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:26:04,914][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:26:05,710][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:26:06,505][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:26:07,298][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:26:08,095][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:26:08,893][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:26:09,689][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:26:10,484][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:26:11,278][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:26:12,074][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:26:12,872][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:26:13,670][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:26:14,467][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:26:15,261][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:26:16,056][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:26:16,851][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:26:17,649][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:26:18,446][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:26:19,242][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:26:20,038][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:26:20,833][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:26:21,631][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:26:22,947][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:26:24,519][mllm.training.trainer_common][INFO] - For task: Apply reinforce step
,
            ΔVRAM Allocated: 2.288818359375e-05 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:27,
            Percentage of VRAM taken: 69.91579512747062%,
            
[2025-08-20 21:26:25,422][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/policy_optimizer_state.pt
[2025-08-20 21:26:25,423][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/critic_optimizer_state.pt
[2025-08-20 21:26:26,728][__main__][INFO] - Iteration 856 took 56s (38.99% Gen, 61.01% Train). Generation: 21s, Training: 34s. Estimated remaining time: 2h 17m 56s. Estimated total time: 15h 33m 52s. Time estimates for 10 more iterations: 9m 20s, 100 more iterations: 1h 33m 23s, 500 more iterations: 7h 46m 56s.
[2025-08-20 21:26:26,729][__main__][INFO] - Starting iteration 856.
[2025-08-20 21:26:50,894][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 21:26:50,895][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 21:26:50,902][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 21:26:53,364][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 21:26:53,365][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 21:26:53,372][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 21:26:53,374][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data.
[2025-08-20 21:26:53,374][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets.
[2025-08-20 21:26:53,673][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:26:54,462][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:26:55,258][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:26:56,051][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:26:56,843][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:26:57,639][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:26:58,436][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:26:59,232][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:27:00,026][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:27:00,819][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:27:01,617][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:27:02,414][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:27:03,211][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:27:04,004][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:27:04,798][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:27:05,595][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:27:06,393][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:27:07,188][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:27:07,983][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:27:08,778][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:27:09,576][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:27:10,374][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:27:11,171][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:27:11,968][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:27:12,763][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:27:13,558][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:27:14,354][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:27:15,153][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:27:15,951][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:27:17,367][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:27:18,162][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:27:18,956][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:27:20,525][mllm.training.trainer_common][INFO] - For task: Apply reinforce step
,
            ΔVRAM Allocated: 2.288818359375e-05 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:27,
            Percentage of VRAM taken: 69.91579512747062%,
            
[2025-08-20 21:27:21,502][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/policy_optimizer_state.pt
[2025-08-20 21:27:21,503][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/critic_optimizer_state.pt
[2025-08-20 21:27:22,802][__main__][INFO] - Iteration 857 took 56s (38.75% Gen, 61.25% Train). Generation: 21s, Training: 34s. Estimated remaining time: 2h 17m 40s. Estimated total time: 15h 34m 32s. Time estimates for 10 more iterations: 9m 20s, 100 more iterations: 1h 33m 27s, 500 more iterations: 7h 47m 16s.
[2025-08-20 21:27:22,804][__main__][INFO] - Starting iteration 857.
[2025-08-20 21:27:46,986][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 21:27:46,988][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 21:27:46,994][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 21:27:49,447][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 21:27:49,448][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 21:27:49,454][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 21:27:49,457][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data.
[2025-08-20 21:27:49,457][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets.
[2025-08-20 21:27:49,756][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:27:50,545][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:27:51,340][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:27:52,135][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:27:52,928][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:27:53,723][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:27:54,520][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:27:55,315][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:27:56,110][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:27:56,904][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:27:57,700][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:27:58,497][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:27:59,294][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:28:00,088][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:28:00,883][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:28:01,678][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:28:02,476][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:28:03,272][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:28:04,066][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:28:04,861][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:28:05,658][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:28:06,457][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:28:07,254][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:28:08,051][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:28:08,847][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:28:10,076][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:28:10,870][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:28:11,664][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:28:12,457][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:28:13,253][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:28:14,047][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:28:14,840][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:28:16,401][mllm.training.trainer_common][INFO] - For task: Apply reinforce step
,
            ΔVRAM Allocated: 2.288818359375e-05 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:26,
            Percentage of VRAM taken: 69.91579512747062%,
            
[2025-08-20 21:28:17,283][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/policy_optimizer_state.pt
[2025-08-20 21:28:17,285][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/critic_optimizer_state.pt
[2025-08-20 21:28:18,662][__main__][INFO] - Iteration 858 took 55s (38.93% Gen, 61.07% Train). Generation: 21s, Training: 34s. Estimated remaining time: 2h 13m 10s. Estimated total time: 15h 30m 57s. Time estimates for 10 more iterations: 9m 18s, 100 more iterations: 1h 33m 5s, 500 more iterations: 7h 45m 28s.
[2025-08-20 21:28:18,663][__main__][INFO] - Starting iteration 858.
[2025-08-20 21:28:43,313][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 21:28:43,315][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 21:28:43,322][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 21:28:45,777][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 21:28:45,778][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 21:28:45,785][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 21:28:45,787][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data.
[2025-08-20 21:28:45,788][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets.
[2025-08-20 21:28:46,088][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:28:46,879][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:28:47,670][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:28:48,462][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:28:49,256][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:28:50,047][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:28:50,842][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:28:51,636][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:28:52,428][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:28:53,221][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:28:54,016][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:28:54,808][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:28:55,602][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:28:56,396][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:28:57,189][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:28:57,982][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:28:58,812][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:28:59,605][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:29:00,400][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:29:01,196][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:29:02,525][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:29:03,365][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:29:04,161][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:29:04,959][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:29:05,759][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:29:06,558][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:29:07,354][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:29:08,149][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:29:08,942][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:29:09,738][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:29:10,532][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:29:11,326][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:29:12,957][mllm.training.trainer_common][INFO] - For task: Apply reinforce step
,
            ΔVRAM Allocated: 2.288818359375e-05 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:27,
            Percentage of VRAM taken: 69.91579512747062%,
            
[2025-08-20 21:29:13,935][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/policy_optimizer_state.pt
[2025-08-20 21:29:13,937][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/critic_optimizer_state.pt
[2025-08-20 21:29:15,381][__main__][INFO] - Iteration 859 took 56s (39.15% Gen, 60.85% Train). Generation: 22s, Training: 34s. Estimated remaining time: 2h 26m 33s. Estimated total time: 15h 45m 17s. Time estimates for 10 more iterations: 9m 27s, 100 more iterations: 1h 34m 31s, 500 more iterations: 7h 52m 38s.
[2025-08-20 21:29:15,385][__main__][INFO] - Starting iteration 859.
[2025-08-20 21:29:39,683][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 21:29:39,684][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 21:29:39,691][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 21:29:42,150][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 21:29:42,152][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 21:29:42,158][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 21:29:42,160][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data.
[2025-08-20 21:29:42,161][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets.
[2025-08-20 21:29:42,458][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:29:43,249][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:29:44,044][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:29:44,839][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:29:45,633][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:29:46,428][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:29:47,225][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:29:48,021][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:29:48,815][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:29:49,609][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:29:50,406][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:29:51,204][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:29:52,001][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:29:52,798][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:29:53,592][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:29:54,386][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:29:55,181][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:29:55,977][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:29:56,772][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:29:57,566][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:29:58,364][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:29:59,162][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:29:59,960][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:30:00,757][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:30:01,551][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:30:02,345][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:30:03,140][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:30:03,937][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:30:04,731][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:30:05,525][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:30:06,820][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:30:07,614][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:30:09,224][mllm.training.trainer_common][INFO] - For task: Apply reinforce step
,
            ΔVRAM Allocated: 2.288818359375e-05 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:27,
            Percentage of VRAM taken: 69.91579512747062%,
            
[2025-08-20 21:30:10,173][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/policy_optimizer_state.pt
[2025-08-20 21:30:10,174][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/critic_optimizer_state.pt
[2025-08-20 21:30:11,544][__main__][INFO] - Iteration 860 took 56s (38.88% Gen, 61.11% Train). Generation: 21s, Training: 34s. Estimated remaining time: 2h 16m 12s. Estimated total time: 15h 35m 52s. Time estimates for 10 more iterations: 9m 21s, 100 more iterations: 1h 33m 35s, 500 more iterations: 7h 47m 56s.
[2025-08-20 21:30:11,545][__main__][INFO] - Starting iteration 860.
[2025-08-20 21:30:35,851][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 21:30:35,852][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 21:30:35,858][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 21:30:38,300][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 21:30:38,301][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 21:30:38,307][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 21:30:38,310][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data.
[2025-08-20 21:30:38,310][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets.
[2025-08-20 21:30:38,609][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:30:39,401][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:30:40,194][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:30:40,989][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:30:41,786][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:30:42,581][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:30:43,374][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:30:44,169][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:30:44,966][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:30:45,763][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:30:46,557][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:30:47,351][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:30:48,147][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:30:48,945][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:30:49,742][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:30:50,536][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:30:51,331][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:30:52,126][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:30:52,924][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:30:53,722][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:30:54,518][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:30:55,313][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:30:56,108][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:30:56,901][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:30:57,696][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:30:58,489][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:30:59,284][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:31:00,081][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:31:00,877][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:31:02,081][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:31:02,875][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:31:03,667][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:31:05,231][mllm.training.trainer_common][INFO] - For task: Apply reinforce step
,
            ΔVRAM Allocated: 2.288818359375e-05 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:26,
            Percentage of VRAM taken: 69.91579512747062%,
            
[2025-08-20 21:31:06,191][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/policy_optimizer_state.pt
[2025-08-20 21:31:06,193][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/critic_optimizer_state.pt
[2025-08-20 21:31:07,758][__main__][INFO] - Iteration 861 took 56s (38.89% Gen, 61.11% Train). Generation: 21s, Training: 34s. Estimated remaining time: 2h 16m 15s. Estimated total time: 15h 36m 51s. Time estimates for 10 more iterations: 9m 22s, 100 more iterations: 1h 33m 41s, 500 more iterations: 7h 48m 25s.
[2025-08-20 21:31:07,759][__main__][INFO] - Starting iteration 861.
[2025-08-20 21:31:32,119][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 21:31:32,121][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 21:31:32,127][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 21:31:34,578][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 21:31:34,580][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 21:31:34,586][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 21:31:34,589][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data.
[2025-08-20 21:31:34,589][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets.
[2025-08-20 21:31:34,890][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:31:35,683][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:31:36,476][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:31:37,271][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:31:38,068][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:31:38,865][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:31:39,659][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:31:40,453][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:31:41,250][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:31:42,048][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:31:42,844][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:31:43,639][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:31:44,433][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:31:45,229][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:31:46,025][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:31:46,820][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:31:47,614][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:31:48,410][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:31:49,209][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:31:50,007][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:31:50,803][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:31:51,597][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:31:52,390][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:31:53,186][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:31:53,981][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:31:54,775][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:31:56,052][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:31:56,846][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:31:57,640][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:31:58,434][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:31:59,228][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:32:00,025][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:32:01,608][mllm.training.trainer_common][INFO] - For task: Apply reinforce step
,
            ΔVRAM Allocated: 2.288818359375e-05 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:26,
            Percentage of VRAM taken: 69.91579512747062%,
            
[2025-08-20 21:32:02,519][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/policy_optimizer_state.pt
[2025-08-20 21:32:02,520][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/critic_optimizer_state.pt
[2025-08-20 21:32:03,881][__main__][INFO] - Iteration 862 took 56s (39.03% Gen, 60.96% Train). Generation: 21s, Training: 34s. Estimated remaining time: 2h 13m 48s. Estimated total time: 15h 35m 21s. Time estimates for 10 more iterations: 9m 21s, 100 more iterations: 1h 33m 32s, 500 more iterations: 7h 47m 40s.
[2025-08-20 21:32:03,882][__main__][INFO] - Starting iteration 862.
[2025-08-20 21:32:28,001][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 21:32:28,003][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 21:32:28,009][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 21:32:30,443][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 21:32:30,445][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 21:32:30,451][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 21:32:30,453][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data.
[2025-08-20 21:32:30,454][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets.
[2025-08-20 21:32:30,752][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:32:31,544][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:32:32,337][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:32:33,132][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:32:33,927][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:32:34,721][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:32:35,515][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:32:36,312][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:32:37,109][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:32:37,903][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:32:38,697][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:32:39,492][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:32:40,290][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:32:41,087][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:32:41,882][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:32:42,677][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:32:43,473][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:32:44,270][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:32:45,068][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:32:45,863][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:32:46,656][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:32:47,450][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:32:48,732][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:32:49,525][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:32:50,317][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:32:51,110][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:32:51,907][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:32:52,702][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:32:53,494][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:32:54,288][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:32:55,085][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:32:55,879][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:32:57,449][mllm.training.trainer_common][INFO] - For task: Apply reinforce step
,
            ΔVRAM Allocated: 2.288818359375e-05 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:26,
            Percentage of VRAM taken: 69.91579512747062%,
            
[2025-08-20 21:32:58,348][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/policy_optimizer_state.pt
[2025-08-20 21:32:58,350][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/critic_optimizer_state.pt
[2025-08-20 21:32:59,659][__main__][INFO] - Iteration 863 took 55s (38.90% Gen, 61.10% Train). Generation: 21s, Training: 34s. Estimated remaining time: 2h 7m 7s. Estimated total time: 15h 29m 36s. Time estimates for 10 more iterations: 9m 17s, 100 more iterations: 1h 32m 57s, 500 more iterations: 7h 44m 48s.
[2025-08-20 21:32:59,660][__main__][INFO] - Starting iteration 863.
[2025-08-20 21:33:24,284][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 21:33:24,285][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 21:33:24,292][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 21:33:26,748][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 21:33:26,750][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 21:33:26,756][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 21:33:26,758][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data.
[2025-08-20 21:33:26,759][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets.
[2025-08-20 21:33:27,058][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:33:27,851][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:33:28,644][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:33:29,439][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:33:30,236][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:33:31,033][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:33:31,827][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:33:32,621][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:33:33,416][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:33:34,214][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:33:35,011][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:33:35,805][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:33:36,600][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:33:37,395][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:33:38,193][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:33:38,991][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:33:39,787][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:33:40,581][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:33:41,374][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:33:42,170][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:33:42,965][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:33:43,759][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:33:44,552][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:33:45,350][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:33:46,146][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:33:46,940][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:33:47,734][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:33:48,530][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:33:49,328][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:33:50,125][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:33:51,417][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:33:52,212][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:33:53,785][mllm.training.trainer_common][INFO] - For task: Apply reinforce step
,
            ΔVRAM Allocated: 2.288818359375e-05 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:26,
            Percentage of VRAM taken: 69.91579512747062%,
            
[2025-08-20 21:33:54,690][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/policy_optimizer_state.pt
[2025-08-20 21:33:54,692][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/critic_optimizer_state.pt
[2025-08-20 21:33:56,177][__main__][INFO] - Iteration 864 took 56s (39.22% Gen, 60.78% Train). Generation: 22s, Training: 34s. Estimated remaining time: 2h 18m 31s. Estimated total time: 15h 41m 56s. Time estimates for 10 more iterations: 9m 25s, 100 more iterations: 1h 34m 11s, 500 more iterations: 7h 50m 58s.
[2025-08-20 21:33:56,178][__main__][INFO] - Starting iteration 864.
[2025-08-20 21:34:20,557][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 21:34:20,558][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 21:34:20,564][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 21:34:23,020][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 21:34:23,021][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 21:34:23,027][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 21:34:23,030][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data.
[2025-08-20 21:34:23,030][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets.
[2025-08-20 21:34:23,332][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:34:24,123][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:34:24,918][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:34:25,713][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:34:26,507][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:34:27,301][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:34:28,097][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:34:28,894][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:34:29,689][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:34:30,483][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:34:31,279][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:34:32,076][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:34:32,874][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:34:33,670][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:34:34,464][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:34:35,257][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:34:36,054][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:34:36,849][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:34:37,642][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:34:38,435][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:34:39,232][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:34:40,028][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:34:41,352][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:34:42,148][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:34:42,942][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:34:43,736][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:34:44,532][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:34:45,329][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:34:46,123][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:34:46,918][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:34:47,713][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:34:48,510][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:34:50,112][mllm.training.trainer_common][INFO] - For task: Apply reinforce step
,
            ΔVRAM Allocated: 2.288818359375e-05 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:27,
            Percentage of VRAM taken: 69.91579512747062%,
            
[2025-08-20 21:34:51,094][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/policy_optimizer_state.pt
[2025-08-20 21:34:51,096][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/critic_optimizer_state.pt
[2025-08-20 21:34:52,336][__main__][INFO] - Iteration 865 took 56s (39.03% Gen, 60.96% Train). Generation: 21s, Training: 34s. Estimated remaining time: 2h 11m 36s. Estimated total time: 15h 35m 57s. Time estimates for 10 more iterations: 9m 21s, 100 more iterations: 1h 33m 35s, 500 more iterations: 7h 47m 58s.
[2025-08-20 21:34:52,340][__main__][INFO] - Starting iteration 865.
[2025-08-20 21:35:16,626][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 21:35:16,628][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 21:35:16,646][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 21:35:19,093][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 21:35:19,094][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 21:35:19,101][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 21:35:19,104][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data.
[2025-08-20 21:35:19,104][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets.
[2025-08-20 21:35:19,404][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:35:20,198][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:35:20,992][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:35:21,784][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:35:22,580][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:35:23,376][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:35:24,172][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:35:24,966][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:35:25,761][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:35:26,559][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:35:27,355][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:35:28,150][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:35:28,945][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:35:29,739][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:35:30,534][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:35:31,328][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:35:32,121][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:35:32,916][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:35:33,712][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:35:34,505][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:35:35,299][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:35:36,096][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:35:36,892][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:35:37,686][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:35:38,480][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:35:39,276][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:35:40,073][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:35:40,869][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:35:41,663][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:35:42,908][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:35:43,702][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:35:44,496][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:35:46,108][mllm.training.trainer_common][INFO] - For task: Apply reinforce step
,
            ΔVRAM Allocated: 2.288818359375e-05 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:26,
            Percentage of VRAM taken: 69.91579512747062%,
            
[2025-08-20 21:35:47,029][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/policy_optimizer_state.pt
[2025-08-20 21:35:47,030][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/critic_optimizer_state.pt
[2025-08-20 21:35:48,338][__main__][INFO] - Iteration 866 took 55s (39.00% Gen, 61.00% Train). Generation: 21s, Training: 34s. Estimated remaining time: 2h 7m 59s. Estimated total time: 15h 33m 16s. Time estimates for 10 more iterations: 9m 19s, 100 more iterations: 1h 33m 19s, 500 more iterations: 7h 46m 38s.
[2025-08-20 21:35:48,339][__main__][INFO] - Starting iteration 866.
[2025-08-20 21:36:12,715][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 21:36:12,716][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 21:36:12,723][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 21:36:15,195][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 21:36:15,196][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 21:36:15,203][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 21:36:15,205][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data.
[2025-08-20 21:36:15,206][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets.
[2025-08-20 21:36:15,504][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:36:16,299][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:36:17,091][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:36:17,884][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:36:18,681][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:36:19,476][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:36:20,270][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:36:21,064][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:36:21,862][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:36:22,658][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:36:23,451][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:36:24,244][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:36:25,039][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:36:25,832][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:36:26,625][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:36:27,420][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:36:28,215][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:36:29,008][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:36:29,802][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:36:30,599][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:36:31,395][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:36:32,189][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:36:32,982][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:36:33,777][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:36:34,572][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:36:35,807][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:36:36,600][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:36:37,395][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:36:38,191][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:36:38,984][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:36:39,778][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:36:40,576][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:36:42,162][mllm.training.trainer_common][INFO] - For task: Apply reinforce step
,
            ΔVRAM Allocated: 2.288818359375e-05 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:26,
            Percentage of VRAM taken: 69.91579512747062%,
            
[2025-08-20 21:36:43,061][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/policy_optimizer_state.pt
[2025-08-20 21:36:43,063][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/critic_optimizer_state.pt
[2025-08-20 21:36:44,478][__main__][INFO] - Iteration 867 took 56s (39.03% Gen, 60.97% Train). Generation: 21s, Training: 34s. Estimated remaining time: 2h 9m 25s. Estimated total time: 15h 35m 38s. Time estimates for 10 more iterations: 9m 21s, 100 more iterations: 1h 33m 33s, 500 more iterations: 7h 47m 49s.
[2025-08-20 21:36:44,480][__main__][INFO] - Starting iteration 867.
[2025-08-20 21:37:08,820][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 21:37:08,821][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 21:37:08,827][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 21:37:11,294][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 21:37:11,295][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 21:37:11,302][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 21:37:11,304][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data.
[2025-08-20 21:37:11,305][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets.
[2025-08-20 21:37:11,605][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:37:12,398][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:37:13,191][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:37:13,986][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:37:14,784][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:37:15,580][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:37:16,374][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:37:17,167][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:37:17,963][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:37:18,757][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:37:19,549][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:37:20,343][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:37:21,138][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:37:21,931][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:37:22,725][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:37:23,521][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:37:24,316][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:37:25,109][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:37:25,904][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:37:26,700][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:37:27,495][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:37:28,288][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:37:29,083][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:37:29,880][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:37:30,676][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:37:31,470][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:37:32,795][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:37:33,590][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:37:34,383][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:37:35,179][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:37:35,976][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:37:36,770][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:37:38,361][mllm.training.trainer_common][INFO] - For task: Apply reinforce step
,
            ΔVRAM Allocated: 2.288818359375e-05 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:27,
            Percentage of VRAM taken: 69.91579512747062%,
            
[2025-08-20 21:37:39,283][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/policy_optimizer_state.pt
[2025-08-20 21:37:39,285][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/critic_optimizer_state.pt
[2025-08-20 21:37:40,743][__main__][INFO] - Iteration 868 took 56s (38.87% Gen, 61.12% Train). Generation: 21s, Training: 34s. Estimated remaining time: 2h 10m 32s. Estimated total time: 15h 37m 42s. Time estimates for 10 more iterations: 9m 22s, 100 more iterations: 1h 33m 46s, 500 more iterations: 7h 48m 51s.
[2025-08-20 21:37:40,744][__main__][INFO] - Starting iteration 868.
[2025-08-20 21:38:05,371][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 21:38:05,372][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 21:38:05,379][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 21:38:07,817][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 21:38:07,818][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 21:38:07,825][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 21:38:07,827][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data.
[2025-08-20 21:38:07,828][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets.
[2025-08-20 21:38:08,126][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:38:08,917][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:38:09,709][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:38:10,506][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:38:11,301][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:38:12,093][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:38:12,886][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:38:13,680][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:38:14,472][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:38:15,266][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:38:16,060][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:38:16,852][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:38:17,647][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:38:18,443][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:38:19,238][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:38:20,031][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:38:20,825][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:38:21,622][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:38:22,416][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:38:23,209][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:38:24,457][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:38:25,251][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:38:26,045][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:38:26,842][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:38:27,637][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:38:28,430][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:38:29,225][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:38:30,022][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:38:30,821][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:38:31,618][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:38:32,413][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:38:33,207][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:38:34,835][mllm.training.trainer_common][INFO] - For task: Apply reinforce step
,
            ΔVRAM Allocated: 2.288818359375e-05 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:26,
            Percentage of VRAM taken: 69.91579512747062%,
            
[2025-08-20 21:38:35,800][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/policy_optimizer_state.pt
[2025-08-20 21:38:35,801][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/critic_optimizer_state.pt
[2025-08-20 21:38:37,094][__main__][INFO] - Iteration 869 took 56s (39.36% Gen, 60.64% Train). Generation: 22s, Training: 34s. Estimated remaining time: 2h 11m 3s. Estimated total time: 15h 39m 9s. Time estimates for 10 more iterations: 9m 23s, 100 more iterations: 1h 33m 54s, 500 more iterations: 7h 49m 34s.
[2025-08-20 21:38:37,096][__main__][INFO] - Starting iteration 869.
[2025-08-20 21:39:01,502][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 21:39:01,504][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 21:39:01,510][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 21:39:03,971][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 21:39:03,972][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 21:39:03,979][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 21:39:03,981][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data.
[2025-08-20 21:39:03,982][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets.
[2025-08-20 21:39:04,281][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:39:05,071][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:39:05,867][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:39:06,659][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:39:07,450][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:39:08,245][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:39:09,036][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:39:09,829][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:39:10,625][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:39:11,417][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:39:12,211][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:39:13,006][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:39:13,801][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:39:14,594][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:39:15,388][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:39:16,184][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:39:16,978][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:39:17,771][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:39:18,566][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:39:19,363][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:39:20,156][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:39:20,950][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:39:21,746][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:39:22,541][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:39:23,334][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:39:24,128][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:39:24,924][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:39:25,721][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:39:26,517][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:39:27,311][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:39:28,106][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:39:29,414][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:39:30,985][mllm.training.trainer_common][INFO] - For task: Apply reinforce step
,
            ΔVRAM Allocated: 2.288818359375e-05 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:26,
            Percentage of VRAM taken: 69.91579512747062%,
            
[2025-08-20 21:39:31,892][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/policy_optimizer_state.pt
[2025-08-20 21:39:31,894][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/critic_optimizer_state.pt
[2025-08-20 21:39:33,283][__main__][INFO] - Iteration 870 took 56s (39.04% Gen, 60.96% Train). Generation: 21s, Training: 34s. Estimated remaining time: 2h 7m 24s. Estimated total time: 15h 36m 26s. Time estimates for 10 more iterations: 9m 21s, 100 more iterations: 1h 33m 38s, 500 more iterations: 7h 48m 13s.
[2025-08-20 21:39:33,285][__main__][INFO] - Starting iteration 870.
[2025-08-20 21:39:57,665][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 21:39:57,667][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 21:39:57,673][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 21:40:00,131][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 21:40:00,133][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 21:40:00,139][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 21:40:00,141][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data.
[2025-08-20 21:40:00,142][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets.
[2025-08-20 21:40:00,444][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:40:01,233][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:40:02,028][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:40:02,820][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:40:03,613][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:40:04,407][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:40:05,199][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:40:05,993][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:40:06,789][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:40:07,582][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:40:08,373][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:40:09,167][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:40:09,960][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:40:10,753][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:40:11,549][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:40:12,344][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:40:13,137][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:40:13,931][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:40:14,728][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:40:15,523][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:40:16,316][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:40:17,110][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:40:17,907][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:40:18,702][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:40:19,495][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:40:20,290][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:40:21,087][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:40:21,883][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:40:23,286][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:40:24,082][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:40:24,876][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:40:25,669][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:40:27,238][mllm.training.trainer_common][INFO] - For task: Apply reinforce step
,
            ΔVRAM Allocated: 2.288818359375e-05 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:27,
            Percentage of VRAM taken: 69.91579512747062%,
            
[2025-08-20 21:40:28,148][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/policy_optimizer_state.pt
[2025-08-20 21:40:28,149][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/critic_optimizer_state.pt
[2025-08-20 21:40:30,548][__main__][INFO] - Iteration 871 took 57s (38.26% Gen, 61.74% Train). Generation: 21s, Training: 35s. Estimated remaining time: 2h 24m 23s. Estimated total time: 15h 54m 22s. Time estimates for 10 more iterations: 9m 32s, 100 more iterations: 1h 35m 26s, 500 more iterations: 7h 57m 11s.
[2025-08-20 21:40:30,549][__main__][INFO] - Starting iteration 871.
[2025-08-20 21:40:55,121][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 21:40:55,122][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 21:40:55,128][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 21:40:57,571][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 21:40:57,572][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 21:40:57,579][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 21:40:57,581][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data.
[2025-08-20 21:40:57,581][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets.
[2025-08-20 21:40:57,885][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:40:58,674][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:40:59,468][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:41:00,259][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:41:01,053][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:41:01,847][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:41:02,640][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:41:03,434][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:41:04,230][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:41:05,022][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:41:05,815][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:41:06,611][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:41:07,405][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:41:08,198][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:41:08,993][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:41:09,789][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:41:10,583][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:41:11,377][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:41:12,172][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:41:12,967][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:41:13,760][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:41:14,554][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:41:15,351][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:41:16,576][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:41:17,369][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:41:18,165][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:41:18,958][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:41:19,752][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:41:20,546][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:41:21,340][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:41:22,134][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:41:22,928][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:41:24,544][mllm.training.trainer_common][INFO] - For task: Apply reinforce step
,
            ΔVRAM Allocated: 2.288818359375e-05 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:26,
            Percentage of VRAM taken: 69.91579512747062%,
            
[2025-08-20 21:41:25,506][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/policy_optimizer_state.pt
[2025-08-20 21:41:25,507][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/critic_optimizer_state.pt
[2025-08-20 21:41:26,854][__main__][INFO] - Iteration 872 took 56s (39.26% Gen, 60.73% Train). Generation: 22s, Training: 34s. Estimated remaining time: 2h 7m 28s. Estimated total time: 15h 38m 24s. Time estimates for 10 more iterations: 9m 23s, 100 more iterations: 1h 33m 50s, 500 more iterations: 7h 49m 12s.
[2025-08-20 21:41:26,856][__main__][INFO] - Starting iteration 872.
[2025-08-20 21:41:51,276][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 21:41:51,277][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 21:41:51,284][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 21:41:53,754][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 21:41:53,755][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 21:41:53,761][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 21:41:53,763][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data.
[2025-08-20 21:41:53,764][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets.
[2025-08-20 21:41:54,063][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:41:54,854][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:41:55,646][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:41:56,438][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:41:57,234][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:41:58,027][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:41:58,820][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:41:59,616][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:42:00,411][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:42:01,204][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:42:01,997][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:42:02,794][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:42:03,589][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:42:04,381][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:42:05,175][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:42:05,973][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:42:06,769][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:42:07,999][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:42:08,794][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:42:09,592][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:42:10,387][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:42:11,182][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:42:11,977][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:42:12,773][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:42:13,570][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:42:14,365][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:42:15,158][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:42:15,952][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:42:16,750][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:42:17,545][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:42:18,339][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:42:19,133][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:42:20,728][mllm.training.trainer_common][INFO] - For task: Apply reinforce step
,
            ΔVRAM Allocated: 2.288818359375e-05 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:26,
            Percentage of VRAM taken: 69.91579512747062%,
            
[2025-08-20 21:42:21,678][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/policy_optimizer_state.pt
[2025-08-20 21:42:21,680][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/critic_optimizer_state.pt
[2025-08-20 21:42:23,083][__main__][INFO] - Iteration 873 took 56s (39.06% Gen, 60.94% Train). Generation: 21s, Training: 34s. Estimated remaining time: 2h 5m 15s. Estimated total time: 15h 37m 7s. Time estimates for 10 more iterations: 9m 22s, 100 more iterations: 1h 33m 42s, 500 more iterations: 7h 48m 33s.
[2025-08-20 21:42:23,085][__main__][INFO] - Starting iteration 873.
[2025-08-20 21:42:47,476][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 21:42:47,477][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 21:42:47,483][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 21:42:49,963][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 21:42:49,964][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 21:42:49,971][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 21:42:49,973][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data.
[2025-08-20 21:42:49,973][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets.
[2025-08-20 21:42:50,271][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:42:51,061][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:42:51,856][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:42:52,648][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:42:53,440][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:42:54,235][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:42:55,028][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:42:55,820][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:42:56,617][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:42:57,411][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:42:58,204][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:42:58,998][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:42:59,796][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:43:00,591][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:43:01,383][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:43:02,178][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:43:02,975][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:43:03,771][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:43:04,564][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:43:05,358][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:43:06,155][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:43:06,951][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:43:07,745][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:43:08,539][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:43:09,336][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:43:10,674][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:43:11,467][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:43:12,260][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:43:13,053][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:43:13,849][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:43:14,642][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:43:15,438][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:43:17,045][mllm.training.trainer_common][INFO] - For task: Apply reinforce step
,
            ΔVRAM Allocated: 2.288818359375e-05 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:27,
            Percentage of VRAM taken: 69.91579512747062%,
            
[2025-08-20 21:43:18,019][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/policy_optimizer_state.pt
[2025-08-20 21:43:18,020][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/critic_optimizer_state.pt
[2025-08-20 21:43:19,377][__main__][INFO] - Iteration 874 took 56s (38.95% Gen, 61.05% Train). Generation: 21s, Training: 34s. Estimated remaining time: 2h 5m 23s. Estimated total time: 15h 38m 11s. Time estimates for 10 more iterations: 9m 22s, 100 more iterations: 1h 33m 49s, 500 more iterations: 7h 49m 5s.
[2025-08-20 21:43:19,378][__main__][INFO] - Starting iteration 874.
[2025-08-20 21:43:44,091][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 21:43:44,093][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 21:43:44,099][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 21:43:46,584][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 21:43:46,586][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 21:43:46,592][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 21:43:46,594][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data.
[2025-08-20 21:43:46,595][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets.
[2025-08-20 21:43:46,893][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:43:47,682][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:43:48,476][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:43:49,268][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:43:50,061][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:43:50,857][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:43:51,649][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:43:52,442][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:43:53,239][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:43:54,032][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:43:54,825][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:43:55,618][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:43:56,413][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:43:57,206][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:43:58,001][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:43:58,797][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:43:59,593][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:44:00,386][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:44:01,181][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:44:01,978][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:44:02,774][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:44:03,568][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:44:04,361][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:44:05,157][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:44:05,954][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:44:06,749][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:44:07,954][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:44:08,749][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:44:09,542][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:44:10,337][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:44:11,135][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:44:11,933][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:44:13,532][mllm.training.trainer_common][INFO] - For task: Apply reinforce step
,
            ΔVRAM Allocated: 2.288818359375e-05 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:26,
            Percentage of VRAM taken: 69.91579512747062%,
            
[2025-08-20 21:44:14,514][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/policy_optimizer_state.pt
[2025-08-20 21:44:14,516][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/critic_optimizer_state.pt
[2025-08-20 21:44:15,893][__main__][INFO] - Iteration 875 took 56s (39.35% Gen, 60.65% Train). Generation: 22s, Training: 34s. Estimated remaining time: 2h 8m 9s. Estimated total time: 15h 41m 54s. Time estimates for 10 more iterations: 9m 25s, 100 more iterations: 1h 34m 11s, 500 more iterations: 7h 50m 57s.
[2025-08-20 21:44:15,895][__main__][INFO] - Starting iteration 875.
[2025-08-20 21:44:40,276][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 21:44:40,277][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 21:44:40,283][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 21:44:42,724][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 21:44:42,726][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 21:44:42,732][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 21:44:42,734][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data.
[2025-08-20 21:44:42,735][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets.
[2025-08-20 21:44:43,034][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:44:43,824][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:44:44,617][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:44:45,409][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:44:46,203][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:44:46,997][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:44:47,789][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:44:48,584][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:44:49,379][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:44:50,173][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:44:50,966][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:44:51,762][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:44:52,557][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:44:53,351][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:44:54,145][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:44:54,943][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:44:55,740][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:44:56,535][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:44:57,329][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:44:58,123][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:44:58,920][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:44:59,716][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:45:00,509][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:45:01,302][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:45:02,099][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:45:02,895][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:45:03,689][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:45:04,483][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:45:05,799][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:45:06,594][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:45:07,389][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:45:08,184][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:45:09,815][mllm.training.trainer_common][INFO] - For task: Apply reinforce step
,
            ΔVRAM Allocated: 2.288818359375e-05 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:27,
            Percentage of VRAM taken: 69.91579512747062%,
            
[2025-08-20 21:45:10,813][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/policy_optimizer_state.pt
[2025-08-20 21:45:10,815][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/critic_optimizer_state.pt
[2025-08-20 21:45:12,270][__main__][INFO] - Iteration 876 took 56s (38.92% Gen, 61.08% Train). Generation: 21s, Training: 34s. Estimated remaining time: 2h 4m 53s. Estimated total time: 15h 39m 34s. Time estimates for 10 more iterations: 9m 23s, 100 more iterations: 1h 33m 57s, 500 more iterations: 7h 49m 47s.
[2025-08-20 21:45:12,271][__main__][INFO] - Starting iteration 876.
[2025-08-20 21:45:36,674][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 21:45:36,675][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 21:45:36,681][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 21:45:39,129][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 21:45:39,130][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 21:45:39,136][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 21:45:39,139][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data.
[2025-08-20 21:45:39,139][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets.
[2025-08-20 21:45:39,437][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:45:40,227][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:45:41,022][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:45:41,814][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:45:42,607][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:45:43,402][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:45:44,195][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:45:44,987][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:45:45,784][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:45:46,578][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:45:47,371][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:45:48,165][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:45:48,963][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:45:49,758][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:45:50,551][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:45:51,345][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:45:52,142][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:45:52,936][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:45:53,730][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:45:54,523][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:45:55,321][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:45:56,116][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:45:56,910][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:45:57,704][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:45:58,498][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:45:59,292][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:46:00,087][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:46:01,444][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:46:02,241][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:46:03,038][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:46:03,835][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:46:04,631][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:46:06,264][mllm.training.trainer_common][INFO] - For task: Apply reinforce step
,
            ΔVRAM Allocated: 2.288818359375e-05 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:27,
            Percentage of VRAM taken: 69.91579512747062%,
            
[2025-08-20 21:46:07,308][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/policy_optimizer_state.pt
[2025-08-20 21:46:07,310][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/critic_optimizer_state.pt
[2025-08-20 21:46:08,686][__main__][INFO] - Iteration 877 took 56s (38.90% Gen, 61.09% Train). Generation: 21s, Training: 34s. Estimated remaining time: 2h 4m 37s. Estimated total time: 15h 40m 15s. Time estimates for 10 more iterations: 9m 24s, 100 more iterations: 1h 34m 1s, 500 more iterations: 7h 50m 7s.
[2025-08-20 21:46:08,688][__main__][INFO] - Starting iteration 877.
[2025-08-20 21:46:33,099][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 21:46:33,101][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 21:46:33,107][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 21:46:35,537][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 21:46:35,539][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 21:46:35,545][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 21:46:35,547][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data.
[2025-08-20 21:46:35,548][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets.
[2025-08-20 21:46:35,846][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:46:36,637][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:46:37,430][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:46:38,225][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:46:39,017][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:46:39,810][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:46:40,606][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:46:41,400][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:46:42,192][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:46:42,987][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:46:43,784][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:46:44,578][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:46:45,371][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:46:46,167][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:46:46,963][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:46:47,757][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:46:48,550][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:46:49,346][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:46:50,142][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:46:50,936][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:46:51,729][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:46:52,523][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:46:53,318][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:46:54,112][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:46:54,907][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:46:55,704][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:46:56,503][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:46:57,299][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:46:58,556][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:46:59,351][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:47:00,147][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:47:00,944][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:47:02,560][mllm.training.trainer_common][INFO] - For task: Apply reinforce step
,
            ΔVRAM Allocated: 2.288818359375e-05 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:26,
            Percentage of VRAM taken: 69.91579512747062%,
            
[2025-08-20 21:47:03,484][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/policy_optimizer_state.pt
[2025-08-20 21:47:03,486][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/critic_optimizer_state.pt
[2025-08-20 21:47:04,860][__main__][INFO] - Iteration 878 took 56s (39.12% Gen, 60.88% Train). Generation: 21s, Training: 34s. Estimated remaining time: 1h 59m 37s. Estimated total time: 15h 36m 11s. Time estimates for 10 more iterations: 9m 21s, 100 more iterations: 1h 33m 37s, 500 more iterations: 7h 48m 5s.
[2025-08-20 21:47:04,861][__main__][INFO] - Starting iteration 878.
[2025-08-20 21:47:29,303][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 21:47:29,304][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 21:47:29,311][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 21:47:31,750][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 21:47:31,751][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 21:47:31,758][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 21:47:31,760][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data.
[2025-08-20 21:47:31,760][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets.
[2025-08-20 21:47:32,059][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:47:32,850][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:47:33,642][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:47:34,434][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:47:35,230][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:47:36,025][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:47:36,818][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:47:37,611][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:47:38,407][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:47:39,200][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:47:39,993][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:47:40,790][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:47:41,586][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:47:42,380][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:47:43,174][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:47:43,970][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:47:44,767][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:47:45,562][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:47:46,356][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:47:47,150][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:47:47,945][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:47:49,184][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:47:49,978][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:47:50,771][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:47:51,566][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:47:52,363][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:47:53,158][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:47:53,952][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:47:54,748][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:47:55,545][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:47:56,342][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:47:57,137][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:47:58,773][mllm.training.trainer_common][INFO] - For task: Apply reinforce step
,
            ΔVRAM Allocated: 2.288818359375e-05 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:26,
            Percentage of VRAM taken: 69.91579512747062%,
            
[2025-08-20 21:47:59,838][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/policy_optimizer_state.pt
[2025-08-20 21:47:59,840][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/critic_optimizer_state.pt
[2025-08-20 21:48:01,293][__main__][INFO] - Iteration 879 took 56s (38.97% Gen, 61.03% Train). Generation: 21s, Training: 34s. Estimated remaining time: 2h 3m 0s. Estimated total time: 15h 40m 31s. Time estimates for 10 more iterations: 9m 24s, 100 more iterations: 1h 34m 3s, 500 more iterations: 7h 50m 15s.
[2025-08-20 21:48:01,294][__main__][INFO] - Starting iteration 879.
[2025-08-20 21:48:25,651][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 21:48:25,652][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 21:48:25,659][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 21:48:28,117][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 21:48:28,119][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 21:48:28,125][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 21:48:28,127][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data.
[2025-08-20 21:48:28,128][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets.
[2025-08-20 21:48:28,425][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:48:29,215][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:48:30,009][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:48:30,800][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:48:31,593][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:48:32,386][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:48:33,178][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:48:33,972][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:48:34,768][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:48:35,561][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:48:36,354][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:48:37,150][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:48:37,943][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:48:38,736][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:48:39,531][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:48:40,326][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:48:41,120][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:48:41,913][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:48:42,709][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:48:43,504][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:48:44,297][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:48:45,092][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:48:45,887][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:48:46,682][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:48:47,477][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:48:48,271][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:48:49,069][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:48:49,867][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:48:51,195][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:48:51,990][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:48:52,790][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:48:53,589][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:48:55,206][mllm.training.trainer_common][INFO] - For task: Apply reinforce step
,
            ΔVRAM Allocated: 2.288818359375e-05 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:27,
            Percentage of VRAM taken: 69.91579512747062%,
            
[2025-08-20 21:48:56,145][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/policy_optimizer_state.pt
[2025-08-20 21:48:56,147][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/critic_optimizer_state.pt
[2025-08-20 21:48:57,454][__main__][INFO] - Iteration 880 took 56s (39.02% Gen, 60.98% Train). Generation: 21s, Training: 34s. Estimated remaining time: 1h 57m 32s. Estimated total time: 15h 35m 59s. Time estimates for 10 more iterations: 9m 21s, 100 more iterations: 1h 33m 35s, 500 more iterations: 7h 47m 59s.
[2025-08-20 21:48:57,455][__main__][INFO] - Starting iteration 880.
[2025-08-20 21:49:22,225][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 21:49:22,226][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 21:49:22,233][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 21:49:24,706][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 21:49:24,707][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 21:49:24,714][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 21:49:24,716][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data.
[2025-08-20 21:49:24,716][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets.
[2025-08-20 21:49:25,014][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:49:25,801][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:49:26,593][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:49:27,382][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:49:28,174][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:49:28,966][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:49:29,758][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:49:30,551][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:49:31,342][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:49:32,134][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:49:32,928][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:49:33,719][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:49:34,512][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:49:35,305][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:49:36,096][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:49:36,891][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:49:37,683][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:49:38,476][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:49:39,272][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:49:40,065][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:49:40,858][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:49:42,076][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:49:42,869][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:49:43,665][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:49:44,458][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:49:45,252][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:49:46,049][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:49:46,844][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:49:47,638][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:49:48,432][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:49:49,229][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:49:50,025][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:49:51,657][mllm.training.trainer_common][INFO] - For task: Apply reinforce step
,
            ΔVRAM Allocated: 2.288818359375e-05 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:26,
            Percentage of VRAM taken: 69.91579512747062%,
            
[2025-08-20 21:49:52,616][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/policy_optimizer_state.pt
[2025-08-20 21:49:52,617][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/critic_optimizer_state.pt
[2025-08-20 21:49:54,003][__main__][INFO] - Iteration 881 took 56s (39.41% Gen, 60.59% Train). Generation: 22s, Training: 34s. Estimated remaining time: 2h 3m 4s. Estimated total time: 15h 42m 26s. Time estimates for 10 more iterations: 9m 25s, 100 more iterations: 1h 34m 14s, 500 more iterations: 7h 51m 13s.
[2025-08-20 21:49:54,004][__main__][INFO] - Starting iteration 881.
[2025-08-20 21:50:18,419][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 21:50:18,421][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 21:50:18,427][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 21:50:20,868][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 21:50:20,869][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 21:50:20,876][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 21:50:20,878][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data.
[2025-08-20 21:50:20,878][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets.
[2025-08-20 21:50:21,179][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:50:21,968][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:50:22,758][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:50:23,551][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:50:24,344][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:50:25,138][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:50:25,933][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:50:26,725][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:50:27,518][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:50:28,313][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:50:29,106][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:50:29,899][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:50:30,695][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:50:31,489][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:50:32,282][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:50:33,077][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:50:33,875][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:50:34,669][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:50:35,462][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:50:36,257][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:50:37,054][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:50:37,851][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:50:38,645][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:50:39,439][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:50:40,767][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:50:41,562][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:50:42,357][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:50:43,155][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:50:43,954][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:50:44,753][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:50:45,553][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:50:46,353][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:50:47,983][mllm.training.trainer_common][INFO] - For task: Apply reinforce step
,
            ΔVRAM Allocated: 2.288818359375e-05 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:27,
            Percentage of VRAM taken: 69.91579512747062%,
            
[2025-08-20 21:50:48,948][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/policy_optimizer_state.pt
[2025-08-20 21:50:48,949][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/critic_optimizer_state.pt
[2025-08-20 21:50:50,367][__main__][INFO] - Iteration 882 took 56s (39.01% Gen, 60.99% Train). Generation: 21s, Training: 34s. Estimated remaining time: 1h 59m 2s. Estimated total time: 15h 39m 21s. Time estimates for 10 more iterations: 9m 23s, 100 more iterations: 1h 33m 56s, 500 more iterations: 7h 49m 40s.
[2025-08-20 21:50:50,368][__main__][INFO] - Starting iteration 882.
[2025-08-20 21:51:14,765][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 21:51:14,776][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 21:51:14,783][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 21:51:17,275][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 21:51:17,276][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 21:51:17,283][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 21:51:17,285][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data.
[2025-08-20 21:51:17,285][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets.
[2025-08-20 21:51:17,585][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:51:18,377][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:51:19,170][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:51:19,962][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:51:20,758][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:51:21,551][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:51:22,343][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:51:23,139][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:51:23,934][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:51:24,726][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:51:25,520][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:51:26,316][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:51:27,109][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:51:27,904][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:51:28,701][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:51:29,498][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:51:30,295][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:51:31,091][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:51:31,885][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:51:32,680][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:51:33,478][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:51:34,277][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:51:35,075][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:51:35,872][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:51:36,668][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:51:37,463][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:51:38,259][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:51:39,057][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:51:39,855][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:51:41,192][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:51:41,987][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:51:42,782][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:51:44,399][mllm.training.trainer_common][INFO] - For task: Apply reinforce step
,
            ΔVRAM Allocated: 2.288818359375e-05 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:27,
            Percentage of VRAM taken: 69.91579512747062%,
            
[2025-08-20 21:51:45,560][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/policy_optimizer_state.pt
[2025-08-20 21:51:45,562][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/critic_optimizer_state.pt
[2025-08-20 21:51:46,952][__main__][INFO] - Iteration 883 took 56s (38.77% Gen, 61.23% Train). Generation: 21s, Training: 34s. Estimated remaining time: 2h 1m 48s. Estimated total time: 15h 43m 3s. Time estimates for 10 more iterations: 9m 25s, 100 more iterations: 1h 34m 18s, 500 more iterations: 7h 51m 31s.
[2025-08-20 21:51:46,954][__main__][INFO] - Starting iteration 883.
[2025-08-20 21:52:11,328][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 21:52:11,329][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 21:52:11,335][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 21:52:13,791][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 21:52:13,793][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 21:52:13,799][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 21:52:13,801][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data.
[2025-08-20 21:52:13,802][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets.
[2025-08-20 21:52:14,100][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:52:14,890][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:52:15,682][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:52:16,477][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:52:17,269][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:52:18,063][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:52:18,859][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:52:19,651][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:52:20,444][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:52:21,241][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:52:22,036][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:52:22,830][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:52:23,624][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:52:24,421][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:52:25,219][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:52:26,016][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:52:26,811][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:52:27,606][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:52:28,402][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:52:29,200][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:52:29,997][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:52:30,794][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:52:31,589][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:52:32,384][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:52:33,180][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:52:33,978][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:52:35,211][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:52:36,006][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:52:36,801][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:52:37,599][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:52:38,398][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:52:39,194][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:52:40,808][mllm.training.trainer_common][INFO] - For task: Apply reinforce step
,
            ΔVRAM Allocated: 2.288818359375e-05 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:26,
            Percentage of VRAM taken: 69.91579512747062%,
            
[2025-08-20 21:52:41,780][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/policy_optimizer_state.pt
[2025-08-20 21:52:41,781][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/critic_optimizer_state.pt
[2025-08-20 21:52:43,183][__main__][INFO] - Iteration 884 took 56s (39.00% Gen, 61.00% Train). Generation: 21s, Training: 34s. Estimated remaining time: 1h 54m 56s. Estimated total time: 15h 37m 8s. Time estimates for 10 more iterations: 9m 22s, 100 more iterations: 1h 33m 42s, 500 more iterations: 7h 48m 34s.
[2025-08-20 21:52:43,185][__main__][INFO] - Starting iteration 884.
[2025-08-20 21:53:07,448][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 21:53:07,450][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 21:53:07,456][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 21:53:09,913][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 21:53:09,915][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 21:53:09,921][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 21:53:09,923][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data.
[2025-08-20 21:53:09,924][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets.
[2025-08-20 21:53:10,224][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:53:11,013][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:53:11,804][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:53:12,596][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:53:13,386][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:53:14,181][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:53:14,972][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:53:15,764][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:53:16,558][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:53:17,351][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:53:18,144][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:53:18,938][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:53:19,729][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:53:20,524][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:53:21,320][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:53:22,114][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:53:22,908][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:53:23,702][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:53:24,497][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:53:25,290][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:53:26,084][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:53:26,881][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:53:28,208][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:53:29,004][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:53:29,801][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:53:30,601][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:53:31,399][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:53:32,197][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:53:32,992][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:53:33,788][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:53:34,584][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:53:35,381][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:53:37,039][mllm.training.trainer_common][INFO] - For task: Apply reinforce step
,
            ΔVRAM Allocated: 2.288818359375e-05 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:27,
            Percentage of VRAM taken: 69.91579512747062%,
            
[2025-08-20 21:53:37,975][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/policy_optimizer_state.pt
[2025-08-20 21:53:37,977][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/critic_optimizer_state.pt
[2025-08-20 21:53:39,374][__main__][INFO] - Iteration 885 took 56s (38.82% Gen, 61.18% Train). Generation: 21s, Training: 34s. Estimated remaining time: 1h 53m 21s. Estimated total time: 15h 36m 29s. Time estimates for 10 more iterations: 9m 21s, 100 more iterations: 1h 33m 38s, 500 more iterations: 7h 48m 14s.
[2025-08-20 21:53:39,376][__main__][INFO] - Starting iteration 885.
[2025-08-20 21:54:04,108][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 21:54:04,109][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 21:54:04,115][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 21:54:06,571][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 21:54:06,572][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 21:54:06,579][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 21:54:06,580][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data.
[2025-08-20 21:54:06,581][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets.
[2025-08-20 21:54:06,879][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:54:07,669][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:54:08,463][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:54:09,256][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:54:10,049][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:54:10,846][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:54:11,640][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:54:12,432][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:54:13,227][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:54:14,025][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:54:14,820][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:54:15,614][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:54:16,408][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:54:17,205][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:54:18,002][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:54:18,796][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:54:19,591][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:54:20,386][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:54:21,184][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:54:21,981][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:54:22,776][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:54:23,571][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:54:24,367][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:54:25,164][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:54:25,963][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:54:26,760][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:54:27,557][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:54:28,351][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:54:29,147][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:54:29,944][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:54:30,743][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:54:32,078][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:54:33,735][mllm.training.trainer_common][INFO] - For task: Apply reinforce step
,
            ΔVRAM Allocated: 2.288818359375e-05 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:27,
            Percentage of VRAM taken: 69.91579512747062%,
            
[2025-08-20 21:54:34,693][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/policy_optimizer_state.pt
[2025-08-20 21:54:34,695][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/critic_optimizer_state.pt
[2025-08-20 21:54:36,054][__main__][INFO] - Iteration 886 took 56s (39.31% Gen, 60.69% Train). Generation: 22s, Training: 34s. Estimated remaining time: 2h 0m 32s. Estimated total time: 15h 44m 37s. Time estimates for 10 more iterations: 9m 26s, 100 more iterations: 1h 34m 27s, 500 more iterations: 7h 52m 18s.
[2025-08-20 21:54:36,055][__main__][INFO] - Starting iteration 886.
[2025-08-20 21:55:00,475][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 21:55:00,476][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 21:55:00,483][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 21:55:02,936][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 21:55:02,937][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 21:55:02,944][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 21:55:02,946][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data.
[2025-08-20 21:55:02,946][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets.
[2025-08-20 21:55:03,244][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:55:04,034][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:55:04,829][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:55:05,623][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:55:06,415][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:55:07,209][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:55:08,003][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:55:08,796][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:55:09,592][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:55:10,389][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:55:11,185][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:55:11,980][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:55:12,773][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:55:13,568][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:55:14,366][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:55:15,162][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:55:15,957][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:55:16,751][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:55:17,549][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:55:18,347][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:55:19,146][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:55:19,944][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:55:20,742][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:55:21,537][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:55:22,333][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:55:23,129][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:55:24,403][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:55:25,198][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:55:25,992][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:55:26,787][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:55:27,585][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:55:28,383][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:55:29,967][mllm.training.trainer_common][INFO] - For task: Apply reinforce step
,
            ΔVRAM Allocated: 2.288818359375e-05 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:26,
            Percentage of VRAM taken: 69.91579512747062%,
            
[2025-08-20 21:55:31,015][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/policy_optimizer_state.pt
[2025-08-20 21:55:31,018][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/critic_optimizer_state.pt
[2025-08-20 21:55:32,454][__main__][INFO] - Iteration 887 took 56s (38.95% Gen, 61.05% Train). Generation: 21s, Training: 34s. Estimated remaining time: 1h 54m 57s. Estimated total time: 15h 39m 58s. Time estimates for 10 more iterations: 9m 23s, 100 more iterations: 1h 33m 59s, 500 more iterations: 7h 49m 59s.
[2025-08-20 21:55:32,456][__main__][INFO] - Starting iteration 887.
[2025-08-20 21:55:56,877][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 21:55:56,878][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 21:55:56,885][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 21:55:59,342][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 21:55:59,343][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 21:55:59,350][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 21:55:59,352][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data.
[2025-08-20 21:55:59,353][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets.
[2025-08-20 21:55:59,652][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:56:00,445][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:56:01,237][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:56:02,032][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:56:02,826][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:56:03,620][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:56:04,415][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:56:05,212][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:56:06,009][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:56:06,803][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:56:07,597][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:56:08,393][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:56:09,192][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:56:09,989][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:56:10,786][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:56:11,581][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:56:12,375][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:56:13,172][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:56:13,971][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:56:14,769][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:56:15,567][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:56:16,364][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:56:17,160][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:56:17,955][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:56:18,751][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:56:19,549][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:56:20,348][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:56:21,657][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:56:22,451][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:56:23,246][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:56:24,039][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:56:24,831][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:56:26,410][mllm.training.trainer_common][INFO] - For task: Apply reinforce step
,
            ΔVRAM Allocated: 2.288818359375e-05 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:27,
            Percentage of VRAM taken: 69.91579512747062%,
            
[2025-08-20 21:56:27,353][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/policy_optimizer_state.pt
[2025-08-20 21:56:27,355][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/critic_optimizer_state.pt
[2025-08-20 21:56:28,766][__main__][INFO] - Iteration 888 took 56s (39.02% Gen, 60.98% Train). Generation: 21s, Training: 34s. Estimated remaining time: 1h 52m 32s. Estimated total time: 15h 38m 29s. Time estimates for 10 more iterations: 9m 23s, 100 more iterations: 1h 33m 50s, 500 more iterations: 7h 49m 14s.
[2025-08-20 21:56:28,768][__main__][INFO] - Starting iteration 888.
[2025-08-20 21:56:52,933][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 21:56:52,934][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 21:56:52,941][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 21:56:55,395][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 21:56:55,396][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 21:56:55,403][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 21:56:55,405][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data.
[2025-08-20 21:56:55,405][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets.
[2025-08-20 21:56:55,704][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:56:56,494][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:56:57,286][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:56:58,079][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:56:58,876][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:56:59,673][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:57:00,468][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:57:01,263][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:57:02,057][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:57:02,854][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:57:03,652][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:57:04,447][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:57:05,242][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:57:06,037][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:57:06,834][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:57:07,632][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:57:08,429][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:57:09,224][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:57:10,019][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:57:10,815][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:57:12,128][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:57:12,922][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:57:13,717][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:57:14,514][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:57:15,311][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:57:16,111][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:57:16,909][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:57:17,707][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:57:18,503][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:57:19,299][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:57:20,095][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:57:20,893][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:57:22,502][mllm.training.trainer_common][INFO] - For task: Apply reinforce step
,
            ΔVRAM Allocated: 2.288818359375e-05 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:27,
            Percentage of VRAM taken: 69.91579512747062%,
            
[2025-08-20 21:57:23,491][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/policy_optimizer_state.pt
[2025-08-20 21:57:23,493][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/critic_optimizer_state.pt
[2025-08-20 21:57:24,820][__main__][INFO] - Iteration 889 took 56s (38.76% Gen, 61.23% Train). Generation: 21s, Training: 34s. Estimated remaining time: 1h 47m 18s. Estimated total time: 15h 34m 12s. Time estimates for 10 more iterations: 9m 20s, 100 more iterations: 1h 33m 25s, 500 more iterations: 7h 47m 6s.
[2025-08-20 21:57:24,822][__main__][INFO] - Starting iteration 889.
[2025-08-20 21:57:49,181][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 21:57:49,183][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 21:57:49,189][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 21:57:51,645][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 21:57:51,646][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 21:57:51,653][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 21:57:51,655][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data.
[2025-08-20 21:57:51,656][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets.
[2025-08-20 21:57:51,960][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:57:52,754][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:57:53,545][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:57:54,339][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:57:55,135][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:57:55,929][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:57:56,723][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:57:57,518][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:57:58,316][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:57:59,113][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:57:59,907][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:58:00,702][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:58:01,498][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:58:02,295][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:58:03,093][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:58:03,889][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:58:04,684][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:58:05,479][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:58:06,277][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:58:07,075][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:58:07,874][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:58:08,672][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:58:09,469][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:58:10,804][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:58:11,599][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:58:12,396][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:58:13,193][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:58:13,992][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:58:14,790][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:58:15,586][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:58:16,381][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:58:17,175][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:58:18,744][mllm.training.trainer_common][INFO] - For task: Apply reinforce step
,
            ΔVRAM Allocated: 2.288818359375e-05 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:27,
            Percentage of VRAM taken: 69.91579512747062%,
            
[2025-08-20 21:58:19,688][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/policy_optimizer_state.pt
[2025-08-20 21:58:19,690][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/critic_optimizer_state.pt
[2025-08-20 21:58:21,178][__main__][INFO] - Iteration 890 took 56s (38.89% Gen, 61.11% Train). Generation: 21s, Training: 34s. Estimated remaining time: 1h 51m 25s. Estimated total time: 15h 39m 15s. Time estimates for 10 more iterations: 9m 23s, 100 more iterations: 1h 33m 55s, 500 more iterations: 7h 49m 37s.
[2025-08-20 21:58:21,180][__main__][INFO] - Starting iteration 890.
[2025-08-20 21:58:45,819][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 21:58:45,820][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 21:58:45,826][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 21:58:48,310][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 21:58:48,311][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 21:58:48,318][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 21:58:48,320][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data.
[2025-08-20 21:58:48,320][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets.
[2025-08-20 21:58:48,621][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:58:49,413][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:58:50,205][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:58:51,000][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:58:51,795][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:58:52,590][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:58:53,383][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:58:54,180][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:58:54,976][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:58:55,772][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:58:56,566][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:58:57,361][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:58:58,159][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:58:58,957][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:58:59,754][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:59:00,549][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:59:01,343][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:59:02,139][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:59:02,937][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:59:03,736][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:59:04,533][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:59:05,328][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:59:06,123][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:59:06,919][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:59:07,717][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:59:08,515][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:59:09,314][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:59:10,112][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:59:10,908][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:59:11,702][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:59:13,017][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:59:13,813][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:59:15,379][mllm.training.trainer_common][INFO] - For task: Apply reinforce step
,
            ΔVRAM Allocated: 2.288818359375e-05 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:27,
            Percentage of VRAM taken: 69.91579512747062%,
            
[2025-08-20 21:59:16,440][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/policy_optimizer_state.pt
[2025-08-20 21:59:16,443][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/critic_optimizer_state.pt
[2025-08-20 21:59:17,827][__main__][INFO] - Iteration 891 took 56s (39.14% Gen, 60.86% Train). Generation: 22s, Training: 34s. Estimated remaining time: 1h 55m 20s. Estimated total time: 15h 44m 7s. Time estimates for 10 more iterations: 9m 26s, 100 more iterations: 1h 34m 24s, 500 more iterations: 7h 52m 3s.
[2025-08-20 21:59:17,830][__main__][INFO] - Starting iteration 891.
[2025-08-20 21:59:42,181][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 21:59:42,183][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 21:59:42,189][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 21:59:44,625][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 21:59:44,627][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 21:59:44,633][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 21:59:44,635][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data.
[2025-08-20 21:59:44,636][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets.
[2025-08-20 21:59:44,935][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:59:45,726][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:59:46,522][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:59:47,317][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:59:48,111][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:59:48,904][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:59:49,702][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:59:50,499][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:59:51,293][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:59:52,087][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:59:52,883][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:59:53,681][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:59:54,479][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:59:55,275][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:59:56,070][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:59:56,865][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:59:57,661][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:59:58,459][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 21:59:59,258][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:00:00,055][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:00:00,851][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:00:01,646][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:00:02,442][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:00:03,240][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:00:04,039][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:00:04,839][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:00:05,638][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:00:06,435][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:00:07,230][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:00:08,523][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:00:09,318][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:00:10,115][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:00:11,691][mllm.training.trainer_common][INFO] - For task: Apply reinforce step
,
            ΔVRAM Allocated: 2.288818359375e-05 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:27,
            Percentage of VRAM taken: 69.91579512747062%,
            
[2025-08-20 22:00:12,667][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/policy_optimizer_state.pt
[2025-08-20 22:00:12,669][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/critic_optimizer_state.pt
[2025-08-20 22:00:14,490][__main__][INFO] - Iteration 892 took 56s (38.67% Gen, 61.32% Train). Generation: 21s, Training: 34s. Estimated remaining time: 1h 54m 36s. Estimated total time: 15h 44m 19s. Time estimates for 10 more iterations: 9m 26s, 100 more iterations: 1h 34m 25s, 500 more iterations: 7h 52m 9s.
[2025-08-20 22:00:14,492][__main__][INFO] - Starting iteration 892.
[2025-08-20 22:00:38,789][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 22:00:38,790][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 22:00:38,797][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 22:00:41,258][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 22:00:41,259][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 22:00:41,265][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 22:00:41,268][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data.
[2025-08-20 22:00:41,268][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets.
[2025-08-20 22:00:41,568][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:00:42,363][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:00:43,156][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:00:43,949][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:00:44,746][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:00:45,543][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:00:46,339][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:00:47,134][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:00:47,929][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:00:48,725][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:00:49,523][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:00:50,321][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:00:51,119][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:00:51,914][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:00:52,709][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:00:53,505][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:00:54,304][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:00:55,103][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:00:55,901][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:00:56,699][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:00:57,495][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:00:58,291][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:00:59,086][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:01:00,325][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:01:01,120][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:01:01,916][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:01:02,710][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:01:03,503][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:01:04,297][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:01:05,094][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:01:05,889][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:01:06,682][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:01:08,259][mllm.training.trainer_common][INFO] - For task: Apply reinforce step
,
            ΔVRAM Allocated: 2.288818359375e-05 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:26,
            Percentage of VRAM taken: 69.91579512747062%,
            
[2025-08-20 22:01:09,176][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/policy_optimizer_state.pt
[2025-08-20 22:01:09,177][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/critic_optimizer_state.pt
[2025-08-20 22:01:10,688][__main__][INFO] - Iteration 893 took 56s (38.87% Gen, 61.13% Train). Generation: 21s, Training: 34s. Estimated remaining time: 1h 45m 55s. Estimated total time: 15h 36m 35s. Time estimates for 10 more iterations: 9m 21s, 100 more iterations: 1h 33m 39s, 500 more iterations: 7h 48m 17s.
[2025-08-20 22:01:10,689][__main__][INFO] - Starting iteration 893.
[2025-08-20 22:01:35,106][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 22:01:35,107][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 22:01:35,113][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 22:01:37,571][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 22:01:37,572][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 22:01:37,579][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 22:01:37,581][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data.
[2025-08-20 22:01:37,581][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets.
[2025-08-20 22:01:37,885][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:01:38,676][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:01:39,469][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:01:40,265][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:01:41,059][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:01:41,852][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:01:42,647][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:01:43,443][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:01:44,237][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:01:45,031][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:01:45,827][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:01:46,624][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:01:47,420][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:01:48,213][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:01:49,009][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:01:49,806][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:01:50,604][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:01:51,400][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:01:52,195][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:01:52,989][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:01:53,785][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:01:54,582][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:01:55,378][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:01:56,171][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:01:56,964][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:01:57,758][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:01:58,552][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:01:59,346][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:02:00,142][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:02:01,451][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:02:02,245][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:02:03,041][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:02:04,607][mllm.training.trainer_common][INFO] - For task: Apply reinforce step
,
            ΔVRAM Allocated: 2.288818359375e-05 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:26,
            Percentage of VRAM taken: 69.91579512747062%,
            
[2025-08-20 22:02:05,550][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/policy_optimizer_state.pt
[2025-08-20 22:02:05,551][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/critic_optimizer_state.pt
[2025-08-20 22:02:06,889][__main__][INFO] - Iteration 894 took 56s (39.09% Gen, 60.91% Train). Generation: 21s, Training: 34s. Estimated remaining time: 1h 45m 3s. Estimated total time: 15h 36m 39s. Time estimates for 10 more iterations: 9m 21s, 100 more iterations: 1h 33m 39s, 500 more iterations: 7h 48m 19s.
[2025-08-20 22:02:06,891][__main__][INFO] - Starting iteration 894.
[2025-08-20 22:02:31,231][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 22:02:31,233][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 22:02:31,239][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 22:02:33,691][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 22:02:33,692][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 22:02:33,698][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 22:02:33,701][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data.
[2025-08-20 22:02:33,701][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets.
[2025-08-20 22:02:34,003][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:02:34,794][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:02:35,590][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:02:36,386][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:02:37,180][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:02:37,974][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:02:38,769][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:02:39,566][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:02:40,361][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:02:41,155][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:02:41,951][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:02:42,748][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:02:43,545][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:02:44,341][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:02:45,135][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:02:45,931][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:02:46,728][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:02:47,526][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:02:48,323][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:02:49,117][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:02:49,910][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:02:50,706][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:02:51,501][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:02:52,295][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:02:53,089][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:02:53,885][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:02:55,125][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:02:55,922][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:02:56,717][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:02:57,510][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:02:58,305][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:02:59,102][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:03:00,734][mllm.training.trainer_common][INFO] - For task: Apply reinforce step
,
            ΔVRAM Allocated: 2.288818359375e-05 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:26,
            Percentage of VRAM taken: 69.91579512747062%,
            
[2025-08-20 22:03:01,722][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/policy_optimizer_state.pt
[2025-08-20 22:03:01,723][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/critic_optimizer_state.pt
[2025-08-20 22:03:03,164][__main__][INFO] - Iteration 895 took 56s (38.91% Gen, 61.08% Train). Generation: 21s, Training: 34s. Estimated remaining time: 1h 45m 20s. Estimated total time: 15h 37m 52s. Time estimates for 10 more iterations: 9m 22s, 100 more iterations: 1h 33m 47s, 500 more iterations: 7h 48m 56s.
[2025-08-20 22:03:03,165][__main__][INFO] - Starting iteration 895.
[2025-08-20 22:03:27,809][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 22:03:27,810][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 22:03:27,816][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 22:03:30,315][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 22:03:30,316][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 22:03:30,322][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 22:03:30,325][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data.
[2025-08-20 22:03:30,325][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets.
[2025-08-20 22:03:30,625][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:03:31,416][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:03:32,212][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:03:33,007][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:03:33,800][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:03:34,595][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:03:35,393][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:03:36,191][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:03:36,988][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:03:37,783][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:03:38,577][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:03:39,372][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:03:40,171][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:03:40,969][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:03:41,767][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:03:42,562][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:03:43,357][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:03:44,152][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:03:44,950][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:03:45,748][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:03:46,543][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:03:47,336][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:03:48,574][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:03:49,369][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:03:50,167][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:03:50,963][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:03:51,758][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:03:52,551][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:03:53,348][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:03:54,146][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:03:54,942][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:03:55,737][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:03:57,314][mllm.training.trainer_common][INFO] - For task: Apply reinforce step
,
            ΔVRAM Allocated: 2.288818359375e-05 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:26,
            Percentage of VRAM taken: 69.91579512747062%,
            
[2025-08-20 22:03:58,355][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/policy_optimizer_state.pt
[2025-08-20 22:03:58,358][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/critic_optimizer_state.pt
[2025-08-20 22:03:59,791][__main__][INFO] - Iteration 896 took 56s (39.13% Gen, 60.87% Train). Generation: 22s, Training: 34s. Estimated remaining time: 1h 50m 16s. Estimated total time: 15h 43m 44s. Time estimates for 10 more iterations: 9m 26s, 100 more iterations: 1h 34m 22s, 500 more iterations: 7h 51m 52s.
[2025-08-20 22:03:59,792][__main__][INFO] - Starting iteration 896.
[2025-08-20 22:04:24,199][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 22:04:24,200][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 22:04:24,206][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 22:04:26,654][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 22:04:26,655][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 22:04:26,661][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 22:04:26,664][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data.
[2025-08-20 22:04:26,664][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets.
[2025-08-20 22:04:26,962][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:04:27,755][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:04:28,550][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:04:29,343][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:04:30,137][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:04:30,933][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:04:31,729][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:04:32,522][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:04:33,318][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:04:34,113][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:04:34,911][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:04:35,706][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:04:36,501][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:04:37,296][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:04:38,094][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:04:38,892][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:04:39,689][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:04:40,483][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:04:41,276][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:04:42,071][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:04:43,116][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:04:43,911][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:04:45,243][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:04:46,143][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:04:47,906][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:04:48,700][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:04:49,493][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:04:50,289][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:04:51,085][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:04:51,879][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:04:52,672][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:04:53,467][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:04:55,052][mllm.training.trainer_common][INFO] - For task: Apply reinforce step
,
            ΔVRAM Allocated: 2.288818359375e-05 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:28,
            Percentage of VRAM taken: 69.91579512747062%,
            
[2025-08-20 22:04:55,989][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/policy_optimizer_state.pt
[2025-08-20 22:04:55,991][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/critic_optimizer_state.pt
[2025-08-20 22:04:57,349][__main__][INFO] - Iteration 897 took 57s (38.13% Gen, 61.87% Train). Generation: 21s, Training: 35s. Estimated remaining time: 2h 4m 50s. Estimated total time: 15h 59m 16s. Time estimates for 10 more iterations: 9m 35s, 100 more iterations: 1h 35m 55s, 500 more iterations: 7h 59m 38s.
[2025-08-20 22:04:57,350][__main__][INFO] - Starting iteration 897.
[2025-08-20 22:05:21,616][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 22:05:21,617][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 22:05:21,623][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 22:05:24,068][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 22:05:24,069][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 22:05:24,076][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 22:05:24,079][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data.
[2025-08-20 22:05:24,079][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets.
[2025-08-20 22:05:24,378][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:05:25,172][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:05:25,964][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:05:26,757][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:05:27,553][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:05:28,346][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:05:29,140][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:05:29,937][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:05:30,735][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:05:31,530][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:05:32,324][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:05:33,118][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:05:33,916][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:05:34,712][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:05:35,506][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:05:36,299][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:05:37,093][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:05:37,889][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:05:38,682][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:05:39,475][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:05:40,272][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:05:41,068][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:05:41,862][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:05:43,076][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:05:43,870][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:05:44,664][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:05:45,455][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:05:46,250][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:05:47,044][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:05:47,836][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:05:48,630][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:05:49,426][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:05:51,000][mllm.training.trainer_common][INFO] - For task: Apply reinforce step
,
            ΔVRAM Allocated: 2.288818359375e-05 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:26,
            Percentage of VRAM taken: 69.91579512747062%,
            
[2025-08-20 22:05:52,147][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/policy_optimizer_state.pt
[2025-08-20 22:05:52,149][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/critic_optimizer_state.pt
[2025-08-20 22:05:53,510][__main__][INFO] - Iteration 898 took 56s (38.85% Gen, 61.15% Train). Generation: 21s, Training: 34s. Estimated remaining time: 1h 40m 36s. Estimated total time: 15h 35m 59s. Time estimates for 10 more iterations: 9m 21s, 100 more iterations: 1h 33m 35s, 500 more iterations: 7h 47m 59s.
[2025-08-20 22:05:53,511][__main__][INFO] - Starting iteration 898.
[2025-08-20 22:06:17,945][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 22:06:17,946][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 22:06:17,952][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 22:06:20,425][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 22:06:20,426][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 22:06:20,433][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 22:06:20,435][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data.
[2025-08-20 22:06:20,436][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets.
[2025-08-20 22:06:20,735][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:06:21,526][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:06:22,320][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:06:23,116][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:06:23,911][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:06:24,704][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:06:25,498][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:06:26,294][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:06:27,089][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:06:27,883][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:06:28,679][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:06:29,477][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:06:30,271][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:06:31,064][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:06:31,858][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:06:32,655][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:06:33,450][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:06:34,243][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:06:35,038][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:06:35,834][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:06:36,630][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:06:37,424][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:06:38,218][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:06:39,015][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:06:39,813][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:06:40,608][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:06:41,402][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:06:42,197][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:06:42,991][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:06:44,299][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:06:45,094][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:06:45,892][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:06:47,518][mllm.training.trainer_common][INFO] - For task: Apply reinforce step
,
            ΔVRAM Allocated: 2.288818359375e-05 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:27,
            Percentage of VRAM taken: 69.91579512747062%,
            
[2025-08-20 22:06:48,466][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/policy_optimizer_state.pt
[2025-08-20 22:06:48,468][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/critic_optimizer_state.pt
[2025-08-20 22:06:49,858][__main__][INFO] - Iteration 899 took 56s (39.01% Gen, 60.99% Train). Generation: 21s, Training: 34s. Estimated remaining time: 1h 42m 47s. Estimated total time: 15h 39m 6s. Time estimates for 10 more iterations: 9m 23s, 100 more iterations: 1h 33m 54s, 500 more iterations: 7h 49m 33s.
[2025-08-20 22:06:49,860][__main__][INFO] - Starting iteration 899.
[2025-08-20 22:07:14,221][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 22:07:14,222][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 22:07:14,229][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 22:07:16,701][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 22:07:16,702][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 22:07:16,708][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 22:07:16,711][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data.
[2025-08-20 22:07:16,711][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets.
[2025-08-20 22:07:17,015][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:07:17,806][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:07:18,601][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:07:19,397][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:07:20,191][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:07:20,985][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:07:21,782][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:07:22,579][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:07:23,375][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:07:24,169][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:07:24,962][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:07:25,758][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:07:26,553][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:07:27,346][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:07:28,141][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:07:28,938][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:07:29,734][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:07:30,527][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:07:31,321][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:07:32,118][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:07:32,914][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:07:33,708][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:07:34,501][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:07:35,846][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:07:36,639][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:07:37,435][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:07:38,228][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:07:39,020][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:07:39,813][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:07:40,607][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:07:41,399][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:07:42,194][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:07:43,765][mllm.training.trainer_common][INFO] - For task: Apply reinforce step
,
            ΔVRAM Allocated: 2.288818359375e-05 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:27,
            Percentage of VRAM taken: 69.91579512747062%,
            
[2025-08-20 22:07:44,812][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/policy_optimizer_state.pt
[2025-08-20 22:07:44,814][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/critic_optimizer_state.pt
[2025-08-20 22:07:46,115][__main__][INFO] - Iteration 900 took 56s (38.93% Gen, 61.07% Train). Generation: 21s, Training: 34s. Estimated remaining time: 1h 40m 19s. Estimated total time: 15h 37m 35s. Time estimates for 10 more iterations: 9m 22s, 100 more iterations: 1h 33m 45s, 500 more iterations: 7h 48m 47s.
[2025-08-20 22:07:46,117][__main__][INFO] - Starting iteration 900.
[2025-08-20 22:08:10,882][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 22:08:10,884][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 22:08:10,890][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 22:08:13,338][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 22:08:13,339][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 22:08:13,346][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 22:08:13,349][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data.
[2025-08-20 22:08:13,349][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets.
[2025-08-20 22:08:13,648][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:08:14,439][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:08:15,231][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:08:16,023][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:08:16,816][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:08:17,609][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:08:18,402][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:08:19,194][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:08:19,985][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:08:20,779][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:08:21,571][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:08:22,363][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:08:23,158][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:08:23,950][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:08:24,742][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:08:25,537][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:08:26,329][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:08:27,122][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:08:27,918][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:08:28,711][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:08:29,503][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:08:30,722][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:08:31,519][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:08:32,315][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:08:33,109][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:08:33,904][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:08:34,700][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:08:35,498][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:08:36,295][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:08:37,089][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:08:37,883][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:08:38,679][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:08:40,254][mllm.training.trainer_common][INFO] - For task: Apply reinforce step
,
            ΔVRAM Allocated: 2.288818359375e-05 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:26,
            Percentage of VRAM taken: 69.91579512747062%,
            
[2025-08-20 22:08:41,953][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/policy_optimizer_state.pt
[2025-08-20 22:08:41,956][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/critic_optimizer_state.pt
[2025-08-20 22:08:45,906][__main__][INFO] - Iteration 901 took 59s (37.33% Gen, 58.41% Train). Generation: 22s, Training: 34s. Estimated remaining time: 2h 38m 14s. Estimated total time: 16h 36m 29s. Time estimates for 10 more iterations: 9m 57s, 100 more iterations: 1h 39m 38s, 500 more iterations: 8h 18m 14s.
[2025-08-20 22:08:45,908][__main__][INFO] - Starting iteration 901.
[2025-08-20 22:09:10,329][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 22:09:10,330][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 22:09:10,337][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 22:09:12,770][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 22:09:12,772][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 22:09:12,778][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 22:09:12,780][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data.
[2025-08-20 22:09:12,781][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets.
[2025-08-20 22:09:13,080][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:09:13,869][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:09:14,663][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:09:15,455][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:09:16,247][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:09:17,042][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:09:17,835][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:09:18,627][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:09:19,422][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:09:20,217][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:09:21,009][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:09:21,804][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:09:22,600][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:09:23,393][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:09:24,186][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:09:24,982][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:09:25,778][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:09:26,572][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:09:27,365][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:09:28,161][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:09:28,957][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:09:29,751][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:09:30,545][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:09:31,341][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:09:32,137][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:09:32,932][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:09:33,725][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:09:35,002][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:09:35,797][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:09:36,592][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:09:37,386][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:09:38,181][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:09:39,754][mllm.training.trainer_common][INFO] - For task: Apply reinforce step
,
            ΔVRAM Allocated: 2.288818359375e-05 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:26,
            Percentage of VRAM taken: 69.91579512747062%,
            
[2025-08-20 22:09:40,702][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/policy_optimizer_state.pt
[2025-08-20 22:09:40,704][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/critic_optimizer_state.pt
[2025-08-20 22:09:42,189][__main__][INFO] - Iteration 902 took 56s (39.07% Gen, 60.92% Train). Generation: 21s, Training: 34s. Estimated remaining time: 1h 38m 49s. Estimated total time: 15h 38m 0s. Time estimates for 10 more iterations: 9m 22s, 100 more iterations: 1h 33m 48s, 500 more iterations: 7h 49m 0s.
[2025-08-20 22:09:42,190][__main__][INFO] - Starting iteration 902.
[2025-08-20 22:10:06,411][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 22:10:06,413][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 22:10:06,419][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 22:10:08,867][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 22:10:08,868][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 22:10:08,874][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 22:10:08,876][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data.
[2025-08-20 22:10:08,877][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets.
[2025-08-20 22:10:09,180][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:10:09,970][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:10:10,764][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:10:11,557][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:10:12,349][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:10:13,145][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:10:13,938][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:10:14,730][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:10:15,526][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:10:16,321][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:10:17,115][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:10:17,910][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:10:18,706][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:10:19,502][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:10:20,296][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:10:21,089][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:10:21,885][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:10:22,681][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:10:23,475][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:10:24,269][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:10:25,064][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:10:25,862][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:10:26,657][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:10:27,453][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:10:28,246][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:10:29,043][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:10:29,842][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:10:31,190][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:10:31,982][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:10:32,778][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:10:33,571][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:10:34,364][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:10:35,956][mllm.training.trainer_common][INFO] - For task: Apply reinforce step
,
            ΔVRAM Allocated: 2.288818359375e-05 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:27,
            Percentage of VRAM taken: 69.91579512747062%,
            
[2025-08-20 22:10:37,261][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/policy_optimizer_state.pt
[2025-08-20 22:10:37,263][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/critic_optimizer_state.pt
[2025-08-20 22:10:38,525][__main__][INFO] - Iteration 903 took 56s (38.68% Gen, 61.32% Train). Generation: 21s, Training: 34s. Estimated remaining time: 1h 38m 46s. Estimated total time: 15h 38m 54s. Time estimates for 10 more iterations: 9m 23s, 100 more iterations: 1h 33m 53s, 500 more iterations: 7h 49m 27s.
[2025-08-20 22:10:38,526][__main__][INFO] - Starting iteration 903.
[2025-08-20 22:11:02,894][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 22:11:02,896][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 22:11:02,902][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 22:11:05,354][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 22:11:05,355][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 22:11:05,361][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 22:11:05,363][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data.
[2025-08-20 22:11:05,364][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets.
[2025-08-20 22:11:05,662][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:11:06,451][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:11:07,246][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:11:08,037][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:11:08,831][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:11:09,628][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:11:10,422][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:11:11,216][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:11:12,010][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:11:12,806][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:11:13,600][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:11:14,392][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:11:15,189][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:11:15,984][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:11:16,778][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:11:17,571][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:11:18,367][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:11:19,163][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:11:19,956][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:11:20,750][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:11:21,546][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:11:22,342][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:11:23,137][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:11:23,930][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:11:24,726][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:11:25,524][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:11:26,320][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:11:27,115][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:11:27,908][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:11:29,166][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:11:29,963][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:11:30,759][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:11:32,382][mllm.training.trainer_common][INFO] - For task: Apply reinforce step
,
            ΔVRAM Allocated: 2.288818359375e-05 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:26,
            Percentage of VRAM taken: 69.91579512747062%,
            
[2025-08-20 22:11:33,310][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/policy_optimizer_state.pt
[2025-08-20 22:11:33,312][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/critic_optimizer_state.pt
[2025-08-20 22:11:34,706][__main__][INFO] - Iteration 904 took 56s (38.99% Gen, 61.00% Train). Generation: 21s, Training: 34s. Estimated remaining time: 1h 35m 15s. Estimated total time: 15h 36m 19s. Time estimates for 10 more iterations: 9m 21s, 100 more iterations: 1h 33m 37s, 500 more iterations: 7h 48m 9s.
[2025-08-20 22:11:34,707][__main__][INFO] - Starting iteration 904.
[2025-08-20 22:11:59,067][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 22:11:59,068][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 22:11:59,074][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 22:12:01,514][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 22:12:01,515][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 22:12:01,521][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 22:12:01,524][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data.
[2025-08-20 22:12:01,524][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets.
[2025-08-20 22:12:01,822][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:12:02,613][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:12:03,407][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:12:04,198][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:12:04,992][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:12:05,786][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:12:06,578][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:12:07,371][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:12:08,166][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:12:08,959][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:12:09,752][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:12:10,548][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:12:11,341][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:12:12,134][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:12:12,930][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:12:13,726][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:12:14,519][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:12:15,312][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:12:16,110][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:12:16,908][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:12:17,704][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:12:18,498][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:12:19,726][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:12:20,521][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:12:21,314][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:12:22,109][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:12:22,906][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:12:23,702][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:12:24,495][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:12:25,289][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:12:26,086][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:12:26,882][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:12:28,454][mllm.training.trainer_common][INFO] - For task: Apply reinforce step
,
            ΔVRAM Allocated: 2.288818359375e-05 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:26,
            Percentage of VRAM taken: 69.91579512747062%,
            
[2025-08-20 22:12:29,606][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/policy_optimizer_state.pt
[2025-08-20 22:12:29,608][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/critic_optimizer_state.pt
[2025-08-20 22:12:31,064][__main__][INFO] - Iteration 905 took 56s (38.90% Gen, 61.10% Train). Generation: 21s, Training: 34s. Estimated remaining time: 1h 37m 16s. Estimated total time: 15h 39m 16s. Time estimates for 10 more iterations: 9m 23s, 100 more iterations: 1h 33m 55s, 500 more iterations: 7h 49m 38s.
[2025-08-20 22:12:31,066][__main__][INFO] - Starting iteration 905.
[2025-08-20 22:12:56,276][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 22:12:56,278][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 22:12:56,284][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 22:12:58,741][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 22:12:58,742][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 22:12:58,749][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 22:12:58,751][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data.
[2025-08-20 22:12:58,751][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets.
[2025-08-20 22:12:59,053][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:12:59,843][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:13:00,636][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:13:01,431][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:13:02,223][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:13:03,016][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:13:03,813][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:13:04,608][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:13:05,402][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:13:06,196][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:13:06,992][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:13:07,786][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:13:08,580][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:13:09,373][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:13:10,170][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:13:10,963][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:13:11,756][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:13:12,552][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:13:13,349][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:13:14,145][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:13:14,938][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:13:15,733][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:13:16,530][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:13:17,326][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:13:18,121][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:13:18,915][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:13:20,243][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:13:21,035][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:13:21,828][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:13:22,621][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:13:23,413][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:13:24,207][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:13:25,821][mllm.training.trainer_common][INFO] - For task: Apply reinforce step
,
            ΔVRAM Allocated: 2.288818359375e-05 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:27,
            Percentage of VRAM taken: 69.91579512747062%,
            
[2025-08-20 22:13:26,798][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/policy_optimizer_state.pt
[2025-08-20 22:13:26,800][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/critic_optimizer_state.pt
[2025-08-20 22:13:28,116][__main__][INFO] - Iteration 906 took 57s (39.86% Gen, 60.14% Train). Generation: 22s, Training: 34s. Estimated remaining time: 1h 47m 52s. Estimated total time: 15h 50m 49s. Time estimates for 10 more iterations: 9m 30s, 100 more iterations: 1h 35m 4s, 500 more iterations: 7h 55m 24s.
[2025-08-20 22:13:28,117][__main__][INFO] - Starting iteration 906.
[2025-08-20 22:13:52,858][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 22:13:52,859][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 22:13:52,865][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 22:13:55,326][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 22:13:55,327][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 22:13:55,334][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 22:13:55,336][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data.
[2025-08-20 22:13:55,336][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets.
[2025-08-20 22:13:55,635][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:13:56,425][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:13:57,218][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:13:58,012][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:13:58,804][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:13:59,598][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:14:00,394][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:14:01,187][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:14:01,980][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:14:02,775][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:14:03,571][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:14:04,364][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:14:05,157][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:14:05,954][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:14:06,749][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:14:07,542][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:14:08,336][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:14:09,133][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:14:09,928][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:14:10,722][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:14:11,516][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:14:12,313][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:14:13,110][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:14:13,904][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:14:14,698][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:14:15,493][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:14:16,290][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:14:17,086][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:14:18,291][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:14:19,086][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:14:19,881][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:14:20,676][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:14:22,265][mllm.training.trainer_common][INFO] - For task: Apply reinforce step
,
            ΔVRAM Allocated: 2.288818359375e-05 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:26,
            Percentage of VRAM taken: 69.91579512747062%,
            
[2025-08-20 22:14:23,267][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/policy_optimizer_state.pt
[2025-08-20 22:14:23,269][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/critic_optimizer_state.pt
[2025-08-20 22:14:24,764][__main__][INFO] - Iteration 907 took 56s (39.35% Gen, 60.65% Train). Generation: 22s, Training: 34s. Estimated remaining time: 1h 40m 12s. Estimated total time: 15h 44m 6s. Time estimates for 10 more iterations: 9m 26s, 100 more iterations: 1h 34m 24s, 500 more iterations: 7h 52m 3s.
[2025-08-20 22:14:24,766][__main__][INFO] - Starting iteration 907.
[2025-08-20 22:14:49,073][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 22:14:49,075][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 22:14:49,081][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 22:14:51,516][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 22:14:51,518][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 22:14:51,524][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 22:14:51,526][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data.
[2025-08-20 22:14:51,527][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets.
[2025-08-20 22:14:51,825][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:14:52,615][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:14:53,404][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:14:54,196][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:14:54,987][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:14:55,778][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:14:56,570][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:14:57,360][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:14:58,155][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:14:58,946][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:14:59,738][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:15:00,531][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:15:01,323][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:15:02,116][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:15:02,908][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:15:03,700][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:15:04,494][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:15:05,286][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:15:06,078][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:15:06,873][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:15:07,666][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:15:08,458][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:15:09,253][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:15:10,046][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:15:10,838][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:15:11,633][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:15:12,426][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:15:13,218][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:15:14,013][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:15:14,809][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:15:15,602][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:15:16,395][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:15:18,078][mllm.training.trainer_common][INFO] - For task: Apply reinforce step
,
            ΔVRAM Allocated: 2.288818359375e-05 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:26,
            Percentage of VRAM taken: 69.91579512747062%,
            
[2025-08-20 22:15:19,430][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/policy_optimizer_state.pt
[2025-08-20 22:15:19,432][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/critic_optimizer_state.pt
[2025-08-20 22:15:20,831][__main__][INFO] - Iteration 908 took 56s (39.00% Gen, 61.00% Train). Generation: 21s, Training: 34s. Estimated remaining time: 1h 29m 35s. Estimated total time: 15h 34m 24s. Time estimates for 10 more iterations: 9m 20s, 100 more iterations: 1h 33m 26s, 500 more iterations: 7h 47m 12s.
[2025-08-20 22:15:20,832][__main__][INFO] - Starting iteration 908.
[2025-08-20 22:15:45,166][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 22:15:45,167][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 22:15:45,174][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 22:15:47,678][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 22:15:47,679][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 22:15:47,685][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 22:15:47,688][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data.
[2025-08-20 22:15:47,688][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets.
[2025-08-20 22:15:47,987][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:15:48,780][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:15:49,572][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:15:50,363][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:15:51,159][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:15:51,953][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:15:52,745][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:15:53,541][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:15:54,336][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:15:55,130][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:15:55,923][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:15:56,719][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:15:57,514][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:15:58,307][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:15:59,103][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:15:59,899][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:16:00,695][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:16:01,488][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:16:02,282][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:16:03,078][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:16:03,875][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:16:04,669][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:16:05,463][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:16:06,258][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:16:07,056][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:16:07,852][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:16:08,647][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:16:09,442][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:16:10,238][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:16:11,036][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:16:12,478][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:16:13,274][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:16:14,889][mllm.training.trainer_common][INFO] - For task: Apply reinforce step
,
            ΔVRAM Allocated: 2.288818359375e-05 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:27,
            Percentage of VRAM taken: 69.91579512747062%,
            
[2025-08-20 22:16:15,867][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/policy_optimizer_state.pt
[2025-08-20 22:16:15,869][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/critic_optimizer_state.pt
[2025-08-20 22:16:17,264][__main__][INFO] - Iteration 909 took 56s (38.71% Gen, 61.29% Train). Generation: 21s, Training: 34s. Estimated remaining time: 1h 34m 44s. Estimated total time: 15h 40m 30s. Time estimates for 10 more iterations: 9m 24s, 100 more iterations: 1h 34m 3s, 500 more iterations: 7h 50m 15s.
[2025-08-20 22:16:17,265][__main__][INFO] - Starting iteration 909.
[2025-08-20 22:16:41,475][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 22:16:41,477][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 22:16:41,483][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 22:16:43,935][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 22:16:43,937][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 22:16:43,943][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 22:16:43,945][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data.
[2025-08-20 22:16:43,946][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets.
[2025-08-20 22:16:44,244][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:16:45,036][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:16:45,828][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:16:46,623][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:16:47,417][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:16:48,209][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:16:49,002][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:16:49,797][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:16:50,589][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:16:51,384][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:16:52,182][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:16:52,977][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:16:53,770][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:16:54,564][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:16:55,361][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:16:56,157][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:16:56,951][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:16:57,745][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:16:58,541][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:16:59,338][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:17:00,133][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:17:00,927][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:17:01,722][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:17:02,519][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:17:03,772][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:17:04,566][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:17:05,360][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:17:06,157][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:17:06,953][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:17:07,747][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:17:08,541][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:17:09,338][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:17:10,961][mllm.training.trainer_common][INFO] - For task: Apply reinforce step
,
            ΔVRAM Allocated: 2.288818359375e-05 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:26,
            Percentage of VRAM taken: 69.91579512747062%,
            
[2025-08-20 22:17:11,921][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/policy_optimizer_state.pt
[2025-08-20 22:17:11,922][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/critic_optimizer_state.pt
[2025-08-20 22:17:13,308][__main__][INFO] - Iteration 910 took 56s (38.83% Gen, 61.16% Train). Generation: 21s, Training: 34s. Estimated remaining time: 1h 27m 20s. Estimated total time: 15h 34m 2s. Time estimates for 10 more iterations: 9m 20s, 100 more iterations: 1h 33m 24s, 500 more iterations: 7h 47m 1s.
[2025-08-20 22:17:13,310][__main__][INFO] - Starting iteration 910.
[2025-08-20 22:17:37,442][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 22:17:37,444][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 22:17:37,450][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 22:17:39,917][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 22:17:39,918][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 22:17:39,925][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 22:17:39,927][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data.
[2025-08-20 22:17:39,928][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets.
[2025-08-20 22:17:40,228][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:17:41,019][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:17:41,810][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:17:42,606][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:17:43,400][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:17:44,192][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:17:44,985][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:17:45,779][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:17:46,573][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:17:47,368][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:17:48,165][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:17:48,959][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:17:49,752][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:17:50,546][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:17:51,344][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:17:52,139][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:17:52,933][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:17:53,726][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:17:54,523][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:17:55,319][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:17:56,113][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:17:57,374][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:17:58,168][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:17:58,962][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:17:59,759][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:18:00,554][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:18:01,349][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:18:02,144][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:18:02,943][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:18:03,738][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:18:04,534][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:18:05,329][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:18:06,961][mllm.training.trainer_common][INFO] - For task: Apply reinforce step
,
            ΔVRAM Allocated: 2.288818359375e-05 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:26,
            Percentage of VRAM taken: 69.91579512747062%,
            
[2025-08-20 22:18:07,921][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/policy_optimizer_state.pt
[2025-08-20 22:18:07,922][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/critic_optimizer_state.pt
[2025-08-20 22:18:09,268][__main__][INFO] - Iteration 911 took 55s (38.74% Gen, 61.25% Train). Generation: 21s, Training: 34s. Estimated remaining time: 1h 24m 59s. Estimated total time: 15h 32m 37s. Time estimates for 10 more iterations: 9m 19s, 100 more iterations: 1h 33m 15s, 500 more iterations: 7h 46m 18s.
[2025-08-20 22:18:09,269][__main__][INFO] - Starting iteration 911.
[2025-08-20 22:18:33,556][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 22:18:33,557][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 22:18:33,564][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 22:18:36,024][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 22:18:36,025][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 22:18:36,032][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 22:18:36,034][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data.
[2025-08-20 22:18:36,034][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets.
[2025-08-20 22:18:36,333][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:18:37,126][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:18:37,916][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:18:38,709][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:18:39,502][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:18:40,294][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:18:41,090][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:18:41,886][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:18:42,679][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:18:43,471][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:18:44,267][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:18:45,062][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:18:45,855][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:18:46,649][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:18:47,444][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:18:48,238][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:18:49,031][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:18:49,827][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:18:50,624][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:18:51,418][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:18:52,212][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:18:53,007][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:18:53,804][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:18:54,600][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:18:56,019][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:18:56,812][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:18:57,607][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:18:58,405][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:18:59,203][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:19:00,000][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:19:00,796][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:19:01,591][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:19:03,221][mllm.training.trainer_common][INFO] - For task: Apply reinforce step
,
            ΔVRAM Allocated: 2.288818359375e-05 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:27,
            Percentage of VRAM taken: 69.91579512747062%,
            
[2025-08-20 22:19:04,173][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/policy_optimizer_state.pt
[2025-08-20 22:19:04,175][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/critic_optimizer_state.pt
[2025-08-20 22:19:05,493][__main__][INFO] - Iteration 912 took 56s (38.82% Gen, 61.18% Train). Generation: 21s, Training: 34s. Estimated remaining time: 1h 28m 29s. Estimated total time: 15h 37m 3s. Time estimates for 10 more iterations: 9m 22s, 100 more iterations: 1h 33m 42s, 500 more iterations: 7h 48m 31s.
[2025-08-20 22:19:05,495][__main__][INFO] - Starting iteration 912.
[2025-08-20 22:19:29,594][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 22:19:29,596][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 22:19:29,602][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 22:19:32,047][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 22:19:32,048][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 22:19:32,054][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 22:19:32,057][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data.
[2025-08-20 22:19:32,057][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets.
[2025-08-20 22:19:32,355][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:19:33,148][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:19:33,940][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:19:34,733][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:19:35,528][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:19:36,320][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:19:37,114][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:19:37,910][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:19:38,704][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:19:39,497][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:19:40,291][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:19:41,087][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:19:41,880][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:19:42,673][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:19:43,469][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:19:44,265][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:19:45,058][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:19:45,852][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:19:46,648][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:19:47,446][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:19:48,240][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:19:49,033][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:19:49,830][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:19:50,627][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:19:51,425][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:19:52,221][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:19:53,477][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:19:54,271][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:19:55,067][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:19:55,866][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:19:56,664][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:19:57,460][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:19:59,132][mllm.training.trainer_common][INFO] - For task: Apply reinforce step
,
            ΔVRAM Allocated: 2.288818359375e-05 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:27,
            Percentage of VRAM taken: 69.91579512747062%,
            
[2025-08-20 22:20:00,143][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/policy_optimizer_state.pt
[2025-08-20 22:20:00,146][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/critic_optimizer_state.pt
[2025-08-20 22:20:01,637][__main__][INFO] - Iteration 913 took 56s (38.57% Gen, 61.43% Train). Generation: 21s, Training: 34s. Estimated remaining time: 1h 26m 11s. Estimated total time: 15h 35m 42s. Time estimates for 10 more iterations: 9m 21s, 100 more iterations: 1h 33m 34s, 500 more iterations: 7h 47m 51s.
[2025-08-20 22:20:01,638][__main__][INFO] - Starting iteration 913.
[2025-08-20 22:20:24,975][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 22:20:24,976][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 22:20:24,982][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 22:20:27,430][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 22:20:27,432][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 22:20:27,438][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 22:20:27,440][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data.
[2025-08-20 22:20:27,441][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets.
[2025-08-20 22:20:27,739][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:20:28,531][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:20:29,322][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:20:30,114][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:20:30,908][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:20:31,699][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:20:32,493][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:20:33,287][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:20:34,079][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:20:34,873][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:20:35,670][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:20:36,462][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:20:37,255][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:20:38,051][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:20:38,845][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:20:39,637][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:20:40,432][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:20:41,229][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:20:42,025][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:20:42,816][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:20:44,097][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:20:44,891][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:20:45,688][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:20:46,483][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:20:47,278][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:20:48,072][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:20:48,869][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:20:49,666][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:20:50,462][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:20:51,256][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:20:52,051][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:20:52,849][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:20:54,487][mllm.training.trainer_common][INFO] - For task: Apply reinforce step
,
            ΔVRAM Allocated: 2.288818359375e-05 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:27,
            Percentage of VRAM taken: 69.91579512747062%,
            
[2025-08-20 22:20:55,498][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/policy_optimizer_state.pt
[2025-08-20 22:20:55,501][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/critic_optimizer_state.pt
[2025-08-20 22:20:57,068][__main__][INFO] - Iteration 914 took 55s (37.69% Gen, 62.31% Train). Generation: 20s, Training: 34s. Estimated remaining time: 1h 13m 23s. Estimated total time: 15h 23m 49s. Time estimates for 10 more iterations: 9m 14s, 100 more iterations: 1h 32m 22s, 500 more iterations: 7h 41m 54s.
[2025-08-20 22:20:57,070][__main__][INFO] - Starting iteration 914.
[2025-08-20 22:21:21,086][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 22:21:21,087][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 22:21:21,093][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 22:21:23,560][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 22:21:23,561][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 22:21:23,567][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 22:21:23,570][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data.
[2025-08-20 22:21:23,570][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets.
[2025-08-20 22:21:23,870][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:21:24,660][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:21:25,454][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:21:26,246][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:21:27,038][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:21:27,834][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:21:28,628][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:21:29,421][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:21:30,216][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:21:31,011][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:21:31,804][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:21:32,597][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:21:33,390][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:21:34,183][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:21:34,977][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:21:35,774][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:21:36,571][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:21:37,364][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:21:38,158][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:21:38,954][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:21:39,752][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:21:40,548][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:21:41,343][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:21:42,137][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:21:42,934][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:21:43,732][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:21:45,045][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:21:45,839][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:21:46,634][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:21:47,431][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:21:48,230][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:21:49,026][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:21:50,682][mllm.training.trainer_common][INFO] - For task: Apply reinforce step
,
            ΔVRAM Allocated: 2.288818359375e-05 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:27,
            Percentage of VRAM taken: 69.91579512747062%,
            
[2025-08-20 22:21:51,626][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/policy_optimizer_state.pt
[2025-08-20 22:21:51,627][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/critic_optimizer_state.pt
[2025-08-20 22:21:52,970][__main__][INFO] - Iteration 915 took 55s (38.57% Gen, 61.42% Train). Generation: 21s, Training: 34s. Estimated remaining time: 1h 20m 17s. Estimated total time: 15h 31m 39s. Time estimates for 10 more iterations: 9m 18s, 100 more iterations: 1h 33m 9s, 500 more iterations: 7h 45m 49s.
[2025-08-20 22:21:52,971][__main__][INFO] - Starting iteration 915.
[2025-08-20 22:22:16,549][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 22:22:16,550][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 22:22:16,556][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 22:22:18,998][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 22:22:19,000][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 22:22:19,007][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 22:22:19,009][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data.
[2025-08-20 22:22:19,010][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets.
[2025-08-20 22:22:19,308][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:22:20,099][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:22:20,891][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:22:21,682][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:22:22,477][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:22:23,269][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:22:24,061][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:22:24,857][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:22:25,650][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:22:26,442][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:22:27,238][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:22:28,032][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:22:28,825][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:22:29,619][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:22:30,417][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:22:31,213][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:22:32,006][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:22:32,800][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:22:34,035][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:22:34,830][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:22:35,625][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:22:36,419][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:22:37,214][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:22:38,010][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:22:38,804][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:22:39,599][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:22:40,396][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:22:41,194][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:22:41,991][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:22:42,786][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:22:43,581][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:22:44,377][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:22:46,036][mllm.training.trainer_common][INFO] - For task: Apply reinforce step
,
            ΔVRAM Allocated: 2.288818359375e-05 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:26,
            Percentage of VRAM taken: 69.91579512747062%,
            
[2025-08-20 22:22:46,999][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/policy_optimizer_state.pt
[2025-08-20 22:22:47,000][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/critic_optimizer_state.pt
[2025-08-20 22:22:48,515][__main__][INFO] - Iteration 916 took 55s (38.06% Gen, 61.94% Train). Generation: 21s, Training: 34s. Estimated remaining time: 1h 13m 26s. Estimated total time: 15h 25m 43s. Time estimates for 10 more iterations: 9m 15s, 100 more iterations: 1h 32m 34s, 500 more iterations: 7h 42m 51s.
[2025-08-20 22:22:48,517][__main__][INFO] - Starting iteration 916.
[2025-08-20 22:23:12,583][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 22:23:12,584][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 22:23:12,591][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 22:23:15,058][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 22:23:15,059][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 22:23:15,066][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 22:23:15,068][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data.
[2025-08-20 22:23:15,069][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets.
[2025-08-20 22:23:15,368][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:23:16,159][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:23:16,951][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:23:17,744][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:23:18,539][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:23:19,331][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:23:20,124][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:23:20,919][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:23:21,712][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:23:22,504][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:23:23,300][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:23:24,095][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:23:24,889][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:23:25,683][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:23:26,476][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:23:27,269][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:23:28,063][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:23:28,860][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:23:29,657][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:23:30,452][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:23:31,713][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:23:32,507][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:23:33,302][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:23:34,099][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:23:34,896][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:23:35,692][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:23:36,488][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:23:37,281][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:23:38,078][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:23:38,876][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:23:39,673][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:23:40,468][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:23:42,097][mllm.training.trainer_common][INFO] - For task: Apply reinforce step
,
            ΔVRAM Allocated: 2.288818359375e-05 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:26,
            Percentage of VRAM taken: 69.91579512747062%,
            
[2025-08-20 22:23:43,056][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/policy_optimizer_state.pt
[2025-08-20 22:23:43,057][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/critic_optimizer_state.pt
[2025-08-20 22:23:44,433][__main__][INFO] - Iteration 917 took 55s (38.61% Gen, 61.39% Train). Generation: 21s, Training: 34s. Estimated remaining time: 1h 18m 42s. Estimated total time: 15h 31m 55s. Time estimates for 10 more iterations: 9m 19s, 100 more iterations: 1h 33m 11s, 500 more iterations: 7h 45m 57s.
[2025-08-20 22:23:44,434][__main__][INFO] - Starting iteration 917.
[2025-08-20 22:24:08,112][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 22:24:08,113][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 22:24:08,119][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 22:24:10,564][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 22:24:10,565][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 22:24:10,571][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 22:24:10,574][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data.
[2025-08-20 22:24:10,574][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets.
[2025-08-20 22:24:10,872][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:24:11,662][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:24:12,453][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:24:13,245][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:24:14,038][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:24:14,830][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:24:15,625][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:24:16,420][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:24:17,212][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:24:18,005][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:24:18,800][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:24:19,594][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:24:20,387][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:24:21,182][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:24:21,976][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:24:22,769][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:24:23,565][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:24:24,362][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:24:25,160][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:24:25,954][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:24:27,315][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:24:28,109][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:24:28,903][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:24:29,700][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:24:30,496][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:24:31,290][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:24:32,085][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:24:32,882][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:24:33,680][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:24:34,477][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:24:35,275][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:24:36,069][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:24:37,702][mllm.training.trainer_common][INFO] - For task: Apply reinforce step
,
            ΔVRAM Allocated: 2.288818359375e-05 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:27,
            Percentage of VRAM taken: 69.91579512747062%,
            
[2025-08-20 22:24:38,678][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/policy_optimizer_state.pt
[2025-08-20 22:24:38,680][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/critic_optimizer_state.pt
[2025-08-20 22:24:40,219][__main__][INFO] - Iteration 918 took 55s (38.06% Gen, 61.94% Train). Generation: 21s, Training: 34s. Estimated remaining time: 1h 15m 34s. Estimated total time: 15h 29m 44s. Time estimates for 10 more iterations: 9m 17s, 100 more iterations: 1h 32m 58s, 500 more iterations: 7h 44m 52s.
[2025-08-20 22:24:40,220][__main__][INFO] - Starting iteration 918.
[2025-08-20 22:25:02,954][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 22:25:02,956][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 22:25:02,962][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 22:25:05,432][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 22:25:05,433][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 22:25:05,439][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 22:25:05,442][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data.
[2025-08-20 22:25:05,442][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets.
[2025-08-20 22:25:05,740][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:25:06,528][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:25:07,319][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:25:08,110][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:25:08,905][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:25:09,696][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:25:10,487][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:25:11,279][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:25:12,070][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:25:12,864][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:25:13,657][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:25:14,451][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:25:15,245][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:25:16,038][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:25:16,830][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:25:17,625][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:25:18,942][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:25:19,736][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:25:20,529][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:25:21,325][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:25:22,122][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:25:22,918][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:25:23,712][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:25:24,506][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:25:25,301][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:25:26,097][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:25:26,891][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:25:27,686][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:25:28,483][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:25:29,281][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:25:30,078][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:25:30,873][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:25:32,576][mllm.training.trainer_common][INFO] - For task: Apply reinforce step
,
            ΔVRAM Allocated: 2.288818359375e-05 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:27,
            Percentage of VRAM taken: 69.91579512747062%,
            
[2025-08-20 22:25:33,508][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/policy_optimizer_state.pt
[2025-08-20 22:25:33,510][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/critic_optimizer_state.pt
[2025-08-20 22:25:34,824][__main__][INFO] - Iteration 919 took 54s (37.14% Gen, 62.85% Train). Generation: 20s, Training: 34s. Estimated remaining time: 54m 59s. Estimated total time: 15h 10m 3s. Time estimates for 10 more iterations: 9m 6s, 100 more iterations: 1h 31m 0s, 500 more iterations: 7h 35m 1s.
[2025-08-20 22:25:34,826][__main__][INFO] - Starting iteration 919.
[2025-08-20 22:25:58,053][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 22:25:58,055][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 22:25:58,061][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 22:26:00,519][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 22:26:00,521][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 22:26:00,527][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 22:26:00,529][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data.
[2025-08-20 22:26:00,530][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets.
[2025-08-20 22:26:00,829][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:26:01,618][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:26:02,410][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:26:03,202][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:26:03,993][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:26:04,788][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:26:05,581][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:26:06,372][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:26:07,168][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:26:07,962][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:26:08,756][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:26:09,550][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:26:10,346][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:26:11,139][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:26:11,933][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:26:12,729][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:26:13,526][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:26:14,321][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:26:15,116][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:26:16,395][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:26:17,187][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:26:17,983][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:26:18,776][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:26:19,569][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:26:20,364][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:26:21,158][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:26:21,951][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:26:22,745][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:26:23,541][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:26:24,334][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:26:25,128][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:26:25,924][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:26:27,558][mllm.training.trainer_common][INFO] - For task: Apply reinforce step
,
            ΔVRAM Allocated: 2.288818359375e-05 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:26,
            Percentage of VRAM taken: 69.91579512747062%,
            
[2025-08-20 22:26:28,515][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/policy_optimizer_state.pt
[2025-08-20 22:26:28,516][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/critic_optimizer_state.pt
[2025-08-20 22:26:29,883][__main__][INFO] - Iteration 920 took 55s (37.74% Gen, 62.26% Train). Generation: 20s, Training: 34s. Estimated remaining time: 1h 1m 38s. Estimated total time: 15h 17m 36s. Time estimates for 10 more iterations: 9m 10s, 100 more iterations: 1h 31m 45s, 500 more iterations: 7h 38m 48s.
[2025-08-20 22:26:29,884][__main__][INFO] - Starting iteration 920.
[2025-08-20 22:26:53,036][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 22:26:53,037][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 22:26:53,043][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 22:26:55,484][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 22:26:55,485][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 22:26:55,491][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 22:26:55,494][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data.
[2025-08-20 22:26:55,494][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets.
[2025-08-20 22:26:55,792][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:26:56,580][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:26:57,370][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:26:58,160][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:26:58,952][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:26:59,742][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:27:00,533][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:27:01,325][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:27:02,116][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:27:02,910][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:27:03,702][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:27:04,492][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:27:05,287][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:27:06,080][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:27:06,872][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:27:07,668][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:27:08,461][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:27:09,253][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:27:10,049][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:27:10,844][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:27:11,637][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:27:12,430][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:27:13,227][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:27:14,021][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:27:14,814][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:27:16,184][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:27:16,979][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:27:17,774][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:27:18,571][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:27:19,366][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:27:20,162][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:27:20,957][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:27:22,576][mllm.training.trainer_common][INFO] - For task: Apply reinforce step
,
            ΔVRAM Allocated: 2.288818359375e-05 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:27,
            Percentage of VRAM taken: 69.91579512747062%,
            
[2025-08-20 22:27:23,540][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/policy_optimizer_state.pt
[2025-08-20 22:27:23,542][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/critic_optimizer_state.pt
[2025-08-20 22:27:24,816][__main__][INFO] - Iteration 921 took 54s (37.72% Gen, 62.28% Train). Generation: 20s, Training: 34s. Estimated remaining time: 58m 37s. Estimated total time: 15h 15m 31s. Time estimates for 10 more iterations: 9m 9s, 100 more iterations: 1h 31m 33s, 500 more iterations: 7h 37m 45s.
[2025-08-20 22:27:24,818][__main__][INFO] - Starting iteration 921.
[2025-08-20 22:27:47,953][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 22:27:47,954][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 22:27:47,960][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 22:27:50,414][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 22:27:50,415][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 22:27:50,422][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 22:27:50,424][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data.
[2025-08-20 22:27:50,424][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets.
[2025-08-20 22:27:50,723][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:27:51,513][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:27:52,304][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:27:53,097][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:27:53,891][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:27:54,683][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:27:55,475][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:27:56,269][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:27:57,062][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:27:57,856][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:27:58,652][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:27:59,446][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:28:00,240][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:28:01,035][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:28:01,831][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:28:02,624][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:28:03,418][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:28:04,214][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:28:05,013][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:28:05,809][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:28:07,119][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:28:07,914][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:28:08,711][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:28:09,507][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:28:10,302][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:28:11,096][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:28:11,892][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:28:12,689][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:28:13,485][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:28:14,279][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:28:15,074][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:28:15,871][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:28:17,524][mllm.training.trainer_common][INFO] - For task: Apply reinforce step
,
            ΔVRAM Allocated: 2.288818359375e-05 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:27,
            Percentage of VRAM taken: 69.91579512747062%,
            
[2025-08-20 22:28:18,453][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/policy_optimizer_state.pt
[2025-08-20 22:28:18,455][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/critic_optimizer_state.pt
[2025-08-20 22:28:19,796][__main__][INFO] - Iteration 922 took 54s (37.63% Gen, 62.37% Train). Generation: 20s, Training: 34s. Estimated remaining time: 58m 29s. Estimated total time: 15h 16m 18s. Time estimates for 10 more iterations: 9m 9s, 100 more iterations: 1h 31m 37s, 500 more iterations: 7h 38m 9s.
[2025-08-20 22:28:19,798][__main__][INFO] - Starting iteration 922.
[2025-08-20 22:28:42,672][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 22:28:42,673][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 22:28:42,679][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 22:28:45,134][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 22:28:45,135][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 22:28:45,142][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 22:28:45,144][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data.
[2025-08-20 22:28:45,145][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets.
[2025-08-20 22:28:45,443][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:28:46,235][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:28:47,026][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:28:47,819][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:28:48,612][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:28:49,405][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:28:50,199][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:28:50,994][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:28:51,788][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:28:52,580][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:28:53,376][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:28:54,171][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:28:54,965][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:28:55,759][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:28:56,556][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:28:57,350][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:28:58,144][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:28:59,414][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:29:00,207][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:29:01,002][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:29:01,798][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:29:02,596][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:29:03,392][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:29:04,188][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:29:04,982][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:29:05,777][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:29:06,574][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:29:07,372][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:29:08,169][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:29:08,963][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:29:09,759][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:29:10,555][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:29:12,180][mllm.training.trainer_common][INFO] - For task: Apply reinforce step
,
            ΔVRAM Allocated: 2.288818359375e-05 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:27,
            Percentage of VRAM taken: 69.91579512747062%,
            
[2025-08-20 22:29:13,142][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/policy_optimizer_state.pt
[2025-08-20 22:29:13,144][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/critic_optimizer_state.pt
[2025-08-20 22:29:14,503][__main__][INFO] - Iteration 923 took 54s (37.34% Gen, 62.66% Train). Generation: 20s, Training: 34s. Estimated remaining time: 53m 1s. Estimated total time: 15h 11m 45s. Time estimates for 10 more iterations: 9m 7s, 100 more iterations: 1h 31m 10s, 500 more iterations: 7h 35m 52s.
[2025-08-20 22:29:14,505][__main__][INFO] - Starting iteration 923.
[2025-08-20 22:29:37,570][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 22:29:37,571][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 22:29:37,578][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 22:29:40,048][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 22:29:40,050][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 22:29:40,057][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 22:29:40,059][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data.
[2025-08-20 22:29:40,059][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets.
[2025-08-20 22:29:40,357][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:29:41,145][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:29:41,938][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:29:42,728][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:29:43,520][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:29:44,311][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:29:45,104][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:29:45,899][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:29:46,690][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:29:47,482][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:29:48,278][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:29:49,071][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:29:49,864][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:29:50,659][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:29:51,454][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:29:52,247][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:29:53,042][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:29:53,839][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:29:54,634][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:29:55,427][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:29:56,221][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:29:57,017][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:29:58,414][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:29:59,207][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:30:00,001][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:30:00,797][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:30:01,594][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:30:02,388][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:30:03,182][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:30:03,977][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:30:04,775][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:30:05,570][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:30:07,202][mllm.training.trainer_common][INFO] - For task: Apply reinforce step
,
            ΔVRAM Allocated: 2.288818359375e-05 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:27,
            Percentage of VRAM taken: 69.91579512747062%,
            
[2025-08-20 22:30:08,126][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/policy_optimizer_state.pt
[2025-08-20 22:30:08,128][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/critic_optimizer_state.pt
[2025-08-20 22:30:09,516][__main__][INFO] - Iteration 924 took 55s (37.44% Gen, 62.56% Train). Generation: 20s, Training: 34s. Estimated remaining time: 57m 12s. Estimated total time: 15h 16m 51s. Time estimates for 10 more iterations: 9m 10s, 100 more iterations: 1h 31m 41s, 500 more iterations: 7h 38m 25s.
[2025-08-20 22:30:09,518][__main__][INFO] - Starting iteration 924.
[2025-08-20 22:30:32,609][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 22:30:32,610][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 22:30:32,617][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 22:30:35,070][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 22:30:35,072][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 22:30:35,078][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 22:30:35,080][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data.
[2025-08-20 22:30:35,081][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets.
[2025-08-20 22:30:35,379][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:30:36,169][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:30:36,963][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:30:37,755][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:30:38,547][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:30:39,342][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:30:40,136][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:30:40,927][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:30:41,723][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:30:42,517][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:30:43,311][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:30:44,105][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:30:44,903][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:30:45,699][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:30:46,493][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:30:47,286][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:30:48,083][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:30:48,878][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:30:49,672][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:30:50,466][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:30:51,262][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:30:52,585][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:30:53,382][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:30:54,179][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:30:54,974][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:30:55,768][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:30:56,563][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:30:57,360][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:30:58,157][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:30:58,954][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:30:59,748][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:31:00,543][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:31:02,201][mllm.training.trainer_common][INFO] - For task: Apply reinforce step
,
            ΔVRAM Allocated: 2.288818359375e-05 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:27,
            Percentage of VRAM taken: 69.91579512747062%,
            
[2025-08-20 22:31:03,133][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/policy_optimizer_state.pt
[2025-08-20 22:31:03,134][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/critic_optimizer_state.pt
[2025-08-20 22:31:04,486][__main__][INFO] - Iteration 925 took 54s (37.53% Gen, 62.47% Train). Generation: 20s, Training: 34s. Estimated remaining time: 55m 33s. Estimated total time: 15h 16m 7s. Time estimates for 10 more iterations: 9m 9s, 100 more iterations: 1h 31m 36s, 500 more iterations: 7h 38m 3s.
[2025-08-20 22:31:04,487][__main__][INFO] - Starting iteration 925.
[2025-08-20 22:31:27,384][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 22:31:27,385][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 22:31:27,391][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 22:31:29,844][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 22:31:29,846][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 22:31:29,852][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 22:31:29,854][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data.
[2025-08-20 22:31:29,855][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets.
[2025-08-20 22:31:30,154][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:31:30,946][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:31:31,737][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:31:32,528][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:31:33,322][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:31:34,115][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:31:34,908][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:31:35,703][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:31:36,496][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:31:37,290][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:31:38,086][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:31:38,883][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:31:39,676][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:31:40,469][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:31:41,267][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:31:42,061][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:31:42,856][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:31:43,650][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:31:44,446][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:31:45,755][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:31:46,549][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:31:47,345][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:31:48,144][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:31:48,940][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:31:49,735][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:31:50,529][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:31:51,326][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:31:52,123][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:31:52,921][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:31:53,717][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:31:54,511][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:31:55,306][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:31:56,942][mllm.training.trainer_common][INFO] - For task: Apply reinforce step
,
            ΔVRAM Allocated: 2.288818359375e-05 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:27,
            Percentage of VRAM taken: 69.91579512747062%,
            
[2025-08-20 22:31:57,919][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/policy_optimizer_state.pt
[2025-08-20 22:31:57,921][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/critic_optimizer_state.pt
[2025-08-20 22:31:59,389][__main__][INFO] - Iteration 926 took 54s (37.25% Gen, 62.74% Train). Generation: 20s, Training: 34s. Estimated remaining time: 53m 32s. Estimated total time: 15h 15m 1s. Time estimates for 10 more iterations: 9m 9s, 100 more iterations: 1h 31m 30s, 500 more iterations: 7h 37m 30s.
[2025-08-20 22:31:59,390][__main__][INFO] - Starting iteration 926.
[2025-08-20 22:32:22,407][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 22:32:22,408][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 22:32:22,415][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 22:32:24,885][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 22:32:24,886][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 22:32:24,893][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 22:32:24,895][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data.
[2025-08-20 22:32:24,895][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets.
[2025-08-20 22:32:25,193][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:32:25,984][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:32:26,773][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:32:27,567][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:32:28,358][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:32:29,150][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:32:29,942][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:32:30,733][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:32:31,528][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:32:32,321][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:32:33,114][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:32:33,908][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:32:34,701][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:32:35,493][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:32:36,289][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:32:37,084][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:32:37,877][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:32:38,669][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:32:39,466][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:32:40,261][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:32:41,054][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:32:41,849][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:32:42,645][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:32:43,969][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:32:44,764][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:32:45,559][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:32:46,353][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:32:47,149][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:32:47,946][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:32:48,744][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:32:49,541][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:32:50,336][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:32:51,954][mllm.training.trainer_common][INFO] - For task: Apply reinforce step
,
            ΔVRAM Allocated: 2.288818359375e-05 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:27,
            Percentage of VRAM taken: 69.91579512747062%,
            
[2025-08-20 22:32:52,901][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/policy_optimizer_state.pt
[2025-08-20 22:32:52,902][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/critic_optimizer_state.pt
[2025-08-20 22:32:54,197][__main__][INFO] - Iteration 927 took 54s (37.51% Gen, 62.49% Train). Generation: 20s, Training: 34s. Estimated remaining time: 51m 3s. Estimated total time: 15h 13m 26s. Time estimates for 10 more iterations: 9m 8s, 100 more iterations: 1h 31m 20s, 500 more iterations: 7h 36m 43s.
[2025-08-20 22:32:54,199][__main__][INFO] - Starting iteration 927.
[2025-08-20 22:33:17,773][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 22:33:17,774][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 22:33:17,780][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 22:33:20,240][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 22:33:20,241][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 22:33:20,247][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 22:33:20,250][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data.
[2025-08-20 22:33:20,250][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets.
[2025-08-20 22:33:20,549][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:33:21,338][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:33:22,132][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:33:22,927][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:33:23,720][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:33:24,513][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:33:25,309][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:33:26,103][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:33:26,897][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:33:27,692][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:33:28,488][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:33:29,282][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:33:30,076][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:33:30,872][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:33:31,669][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:33:32,465][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:33:33,260][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:33:34,053][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:33:34,851][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:33:35,648][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:33:36,445][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:33:37,239][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:33:38,033][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:33:38,830][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:33:39,628][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:33:40,930][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:33:41,727][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:33:42,522][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:33:43,316][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:33:44,110][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:33:44,908][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:33:45,705][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:33:47,287][mllm.training.trainer_common][INFO] - For task: Apply reinforce step
,
            ΔVRAM Allocated: 2.288818359375e-05 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:27,
            Percentage of VRAM taken: 69.91579512747062%,
            
[2025-08-20 22:33:48,262][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/policy_optimizer_state.pt
[2025-08-20 22:33:48,264][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/critic_optimizer_state.pt
[2025-08-20 22:33:49,601][__main__][INFO] - Iteration 928 took 55s (38.13% Gen, 61.87% Train). Generation: 21s, Training: 34s. Estimated remaining time: 1h 0m 3s. Estimated total time: 15h 23m 22s. Time estimates for 10 more iterations: 9m 14s, 100 more iterations: 1h 32m 20s, 500 more iterations: 7h 41m 41s.
[2025-08-20 22:33:49,603][__main__][INFO] - Starting iteration 928.
[2025-08-20 22:34:12,318][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 22:34:12,320][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 22:34:12,326][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 22:34:14,776][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 22:34:14,778][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 22:34:14,784][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 22:34:14,786][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data.
[2025-08-20 22:34:14,787][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets.
[2025-08-20 22:34:15,120][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:34:15,921][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:34:16,715][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:34:17,509][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:34:18,302][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:34:19,096][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:34:19,891][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:34:20,684][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:34:21,478][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:34:22,275][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:34:23,070][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:34:23,864][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:34:24,658][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:34:25,456][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:34:26,253][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:34:27,049][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:34:27,844][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:34:28,638][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:34:29,435][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:34:30,233][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:34:31,030][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:34:32,329][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:34:33,123][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:34:33,916][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:34:34,711][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:34:35,507][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:34:36,301][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:34:37,094][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:34:37,891][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:34:38,688][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:34:39,483][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:34:40,277][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:34:41,881][mllm.training.trainer_common][INFO] - For task: Apply reinforce step
,
            ΔVRAM Allocated: 2.288818359375e-05 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:27,
            Percentage of VRAM taken: 69.91579512747062%,
            
[2025-08-20 22:34:42,765][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/policy_optimizer_state.pt
[2025-08-20 22:34:42,766][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/critic_optimizer_state.pt
[2025-08-20 22:34:44,131][__main__][INFO] - Iteration 929 took 54s (37.16% Gen, 62.84% Train). Generation: 20s, Training: 34s. Estimated remaining time: 44m 34s. Estimated total time: 15h 8m 47s. Time estimates for 10 more iterations: 9m 5s, 100 more iterations: 1h 30m 52s, 500 more iterations: 7h 34m 23s.
[2025-08-20 22:34:44,132][__main__][INFO] - Starting iteration 929.
[2025-08-20 22:35:07,037][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 22:35:07,038][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 22:35:07,044][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 22:35:09,496][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 22:35:09,497][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 22:35:09,504][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 22:35:09,506][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data.
[2025-08-20 22:35:09,507][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets.
[2025-08-20 22:35:09,805][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:35:10,595][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:35:11,387][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:35:12,181][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:35:12,975][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:35:13,766][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:35:14,559][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:35:15,354][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:35:16,147][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:35:16,940][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:35:17,736][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:35:18,531][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:35:19,324][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:35:20,119][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:35:20,918][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:35:21,714][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:35:22,509][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:35:23,847][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:35:24,641][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:35:25,437][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:35:26,233][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:35:27,028][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:35:27,822][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:35:28,616][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:35:29,413][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:35:30,210][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:35:31,005][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:35:31,799][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:35:32,596][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:35:33,394][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:35:34,192][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:35:34,988][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:35:36,617][mllm.training.trainer_common][INFO] - For task: Apply reinforce step
,
            ΔVRAM Allocated: 2.288818359375e-05 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:27,
            Percentage of VRAM taken: 69.91579512747062%,
            
[2025-08-20 22:35:37,619][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/policy_optimizer_state.pt
[2025-08-20 22:35:37,621][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/critic_optimizer_state.pt
[2025-08-20 22:35:38,831][__main__][INFO] - Iteration 930 took 54s (37.36% Gen, 62.64% Train). Generation: 20s, Training: 34s. Estimated remaining time: 46m 30s. Estimated total time: 15h 11m 37s. Time estimates for 10 more iterations: 9m 6s, 100 more iterations: 1h 31m 9s, 500 more iterations: 7h 35m 48s.
[2025-08-20 22:35:38,832][__main__][INFO] - Starting iteration 930.
[2025-08-20 22:36:01,984][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 22:36:01,985][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 22:36:01,991][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 22:36:04,433][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 22:36:04,434][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 22:36:04,441][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 22:36:04,443][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data.
[2025-08-20 22:36:04,443][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets.
[2025-08-20 22:36:04,741][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:36:05,531][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:36:06,324][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:36:07,120][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:36:07,911][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:36:08,704][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:36:09,500][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:36:10,294][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:36:11,088][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:36:11,882][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:36:12,679][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:36:13,474][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:36:14,267][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:36:15,063][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:36:15,859][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:36:16,656][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:36:17,451][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:36:18,244][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:36:19,040][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:36:19,838][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:36:21,069][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:36:21,862][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:36:22,659][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:36:23,457][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:36:24,254][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:36:25,049][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:36:25,843][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:36:26,639][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:36:27,437][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:36:28,235][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:36:29,031][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:36:29,826][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:36:31,419][mllm.training.trainer_common][INFO] - For task: Apply reinforce step
,
            ΔVRAM Allocated: 2.288818359375e-05 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:26,
            Percentage of VRAM taken: 69.91579512747062%,
            
[2025-08-20 22:36:32,313][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/policy_optimizer_state.pt
[2025-08-20 22:36:32,314][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/critic_optimizer_state.pt
[2025-08-20 22:36:33,671][__main__][INFO] - Iteration 931 took 54s (37.77% Gen, 62.23% Train). Generation: 20s, Training: 34s. Estimated remaining time: 47m 56s. Estimated total time: 15h 13m 58s. Time estimates for 10 more iterations: 9m 8s, 100 more iterations: 1h 31m 23s, 500 more iterations: 7h 36m 59s.
[2025-08-20 22:36:33,673][__main__][INFO] - Starting iteration 931.
[2025-08-20 22:36:56,590][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 22:36:56,591][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 22:36:56,598][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 22:36:59,024][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 22:36:59,026][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 22:36:59,032][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 22:36:59,034][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data.
[2025-08-20 22:36:59,035][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets.
[2025-08-20 22:36:59,335][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:37:00,125][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:37:00,917][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:37:01,709][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:37:02,503][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:37:03,296][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:37:04,088][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:37:04,885][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:37:05,682][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:37:06,476][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:37:07,269][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:37:08,064][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:37:08,861][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:37:09,656][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:37:10,450][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:37:11,244][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:37:12,041][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:37:13,348][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:37:14,143][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:37:14,941][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:37:15,737][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:37:16,531][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:37:17,325][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:37:18,122][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:37:18,919][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:37:19,716][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:37:20,510][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:37:21,304][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:37:22,101][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:37:22,899][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:37:23,695][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:37:24,488][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:37:26,077][mllm.training.trainer_common][INFO] - For task: Apply reinforce step
,
            ΔVRAM Allocated: 2.288818359375e-05 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:27,
            Percentage of VRAM taken: 69.91579512747062%,
            
[2025-08-20 22:37:27,020][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/policy_optimizer_state.pt
[2025-08-20 22:37:27,022][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/critic_optimizer_state.pt
[2025-08-20 22:37:28,479][__main__][INFO] - Iteration 932 took 54s (37.36% Gen, 62.63% Train). Generation: 20s, Training: 34s. Estimated remaining time: 46m 27s. Estimated total time: 15h 13m 25s. Time estimates for 10 more iterations: 9m 8s, 100 more iterations: 1h 31m 20s, 500 more iterations: 7h 36m 42s.
[2025-08-20 22:37:28,480][__main__][INFO] - Starting iteration 932.
[2025-08-20 22:37:51,172][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 22:37:51,174][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 22:37:51,180][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 22:37:53,632][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 22:37:53,634][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 22:37:53,640][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 22:37:53,642][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data.
[2025-08-20 22:37:53,643][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets.
[2025-08-20 22:37:53,940][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:37:54,730][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:37:55,524][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:37:56,316][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:37:57,108][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:37:57,903][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:37:58,696][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:37:59,490][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:38:00,286][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:38:01,081][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:38:01,874][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:38:02,668][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:38:03,466][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:38:04,262][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:38:05,056][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:38:05,850][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:38:06,646][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:38:07,443][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:38:08,239][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:38:09,547][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:38:10,344][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:38:11,139][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:38:11,934][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:38:12,728][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:38:13,523][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:38:14,320][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:38:15,115][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:38:15,909][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:38:16,704][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:38:17,503][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:38:18,300][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:38:19,095][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:38:20,685][mllm.training.trainer_common][INFO] - For task: Apply reinforce step
,
            ΔVRAM Allocated: 2.288818359375e-05 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:27,
            Percentage of VRAM taken: 69.91579512747062%,
            
[2025-08-20 22:38:21,608][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/policy_optimizer_state.pt
[2025-08-20 22:38:21,609][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/critic_optimizer_state.pt
[2025-08-20 22:38:22,874][__main__][INFO] - Iteration 933 took 54s (37.22% Gen, 62.78% Train). Generation: 20s, Training: 34s. Estimated remaining time: 38m 41s. Estimated total time: 15h 6m 33s. Time estimates for 10 more iterations: 9m 3s, 100 more iterations: 1h 30m 39s, 500 more iterations: 7h 33m 16s.
[2025-08-20 22:38:22,876][__main__][INFO] - Starting iteration 933.
[2025-08-20 22:38:46,368][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 22:38:46,369][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 22:38:46,376][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 22:38:48,815][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 22:38:48,816][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 22:38:48,823][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 22:38:48,825][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data.
[2025-08-20 22:38:48,826][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets.
[2025-08-20 22:38:49,125][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:38:49,915][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:38:50,709][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:38:51,501][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:38:52,294][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:38:53,089][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:38:53,883][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:38:54,676][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:38:55,471][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:38:56,267][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:38:57,061][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:38:57,855][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:38:58,651][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:38:59,448][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:39:00,244][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:39:01,037][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:39:01,831][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:39:02,629][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:39:03,427][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:39:04,222][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:39:05,016][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:39:05,811][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:39:07,036][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:39:07,830][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:39:08,626][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:39:09,421][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:39:10,215][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:39:11,009][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:39:11,806][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:39:12,603][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:39:13,397][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:39:14,190][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:39:15,764][mllm.training.trainer_common][INFO] - For task: Apply reinforce step
,
            ΔVRAM Allocated: 2.288818359375e-05 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:26,
            Percentage of VRAM taken: 69.91579512747062%,
            
[2025-08-20 22:39:16,686][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/policy_optimizer_state.pt
[2025-08-20 22:39:16,688][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/critic_optimizer_state.pt
[2025-08-20 22:39:18,030][__main__][INFO] - Iteration 934 took 55s (38.19% Gen, 61.81% Train). Generation: 21s, Training: 34s. Estimated remaining time: 50m 27s. Estimated total time: 15h 19m 14s. Time estimates for 10 more iterations: 9m 11s, 100 more iterations: 1h 31m 55s, 500 more iterations: 7h 39m 37s.
[2025-08-20 22:39:18,032][__main__][INFO] - Starting iteration 934.
[2025-08-20 22:39:41,242][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 22:39:41,243][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 22:39:41,250][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 22:39:43,719][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 22:39:43,721][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 22:39:43,727][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 22:39:43,730][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data.
[2025-08-20 22:39:43,731][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets.
[2025-08-20 22:39:44,032][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:39:44,823][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:39:45,614][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:39:46,409][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:39:47,202][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:39:47,996][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:39:48,793][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:39:49,588][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:39:50,382][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:39:51,176][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:39:51,973][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:39:52,771][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:39:53,567][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:39:54,361][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:39:55,155][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:39:55,952][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:39:56,750][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:39:57,546][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:39:58,341][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:39:59,136][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:39:59,933][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:40:01,265][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:40:02,060][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:40:02,854][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:40:03,651][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:40:04,450][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:40:05,248][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:40:06,045][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:40:06,841][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:40:07,637][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:40:08,430][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:40:09,227][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:40:10,795][mllm.training.trainer_common][INFO] - For task: Apply reinforce step
,
            ΔVRAM Allocated: 2.288818359375e-05 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:27,
            Percentage of VRAM taken: 69.91579512747062%,
            
[2025-08-20 22:40:11,722][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/policy_optimizer_state.pt
[2025-08-20 22:40:11,724][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/critic_optimizer_state.pt
[2025-08-20 22:40:13,033][__main__][INFO] - Iteration 935 took 55s (37.75% Gen, 62.24% Train). Generation: 20s, Training: 34s. Estimated remaining time: 46m 59s. Estimated total time: 15h 16m 41s. Time estimates for 10 more iterations: 9m 10s, 100 more iterations: 1h 31m 40s, 500 more iterations: 7h 38m 20s.
[2025-08-20 22:40:13,034][__main__][INFO] - Starting iteration 935.
[2025-08-20 22:40:35,889][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 22:40:35,891][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 22:40:35,897][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 22:40:38,392][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 22:40:38,396][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 22:40:38,405][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 22:40:38,408][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data.
[2025-08-20 22:40:38,408][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets.
[2025-08-20 22:40:38,745][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:40:39,536][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:40:40,328][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:40:41,119][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:40:41,915][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:40:42,708][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:40:43,500][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:40:44,295][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:40:45,090][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:40:45,883][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:40:46,677][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:40:47,475][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:40:48,270][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:40:49,063][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:40:49,856][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:40:50,653][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:40:51,449][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:40:52,244][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:40:53,553][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:40:54,350][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:40:55,144][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:40:55,939][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:40:56,735][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:40:57,533][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:40:58,332][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:40:59,130][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:40:59,926][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:41:00,721][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:41:01,516][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:41:02,311][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:41:03,107][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:41:03,900][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:41:05,563][mllm.training.trainer_common][INFO] - For task: Apply reinforce step
,
            ΔVRAM Allocated: 2.288818359375e-05 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:27,
            Percentage of VRAM taken: 69.91579512747062%,
            
[2025-08-20 22:41:06,548][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/policy_optimizer_state.pt
[2025-08-20 22:41:06,550][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/critic_optimizer_state.pt
[2025-08-20 22:41:07,925][__main__][INFO] - Iteration 936 took 54s (37.19% Gen, 62.81% Train). Generation: 20s, Training: 34s. Estimated remaining time: 44m 13s. Estimated total time: 15h 14m 50s. Time estimates for 10 more iterations: 9m 8s, 100 more iterations: 1h 31m 29s, 500 more iterations: 7h 37m 25s.
[2025-08-20 22:41:07,927][__main__][INFO] - Starting iteration 936.
[2025-08-20 22:41:30,957][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 22:41:30,959][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 22:41:30,965][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 22:41:33,449][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 22:41:33,450][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 22:41:33,456][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 22:41:33,459][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data.
[2025-08-20 22:41:33,459][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets.
[2025-08-20 22:41:33,758][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:41:34,547][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:41:35,340][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:41:36,134][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:41:36,926][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:41:37,720][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:41:38,516][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:41:39,309][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:41:40,103][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:41:40,899][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:41:41,696][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:41:42,491][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:41:43,285][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:41:44,079][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:41:44,877][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:41:45,672][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:41:46,466][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:41:47,260][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:41:48,056][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:41:48,852][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:41:49,646][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:41:50,440][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:41:51,671][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:41:52,465][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:41:53,259][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:41:54,056][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:41:54,853][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:41:55,650][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:41:56,445][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:41:57,238][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:41:58,033][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:41:58,828][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:42:00,391][mllm.training.trainer_common][INFO] - For task: Apply reinforce step
,
            ΔVRAM Allocated: 2.288818359375e-05 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:26,
            Percentage of VRAM taken: 69.91579512747062%,
            
[2025-08-20 22:42:01,364][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/policy_optimizer_state.pt
[2025-08-20 22:42:01,366][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/critic_optimizer_state.pt
[2025-08-20 22:42:02,673][__main__][INFO] - Iteration 937 took 54s (37.55% Gen, 62.45% Train). Generation: 20s, Training: 34s. Estimated remaining time: 40m 54s. Estimated total time: 15h 12m 25s. Time estimates for 10 more iterations: 9m 7s, 100 more iterations: 1h 31m 14s, 500 more iterations: 7h 36m 12s.
[2025-08-20 22:42:02,674][__main__][INFO] - Starting iteration 937.
[2025-08-20 22:42:25,745][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 22:42:25,746][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 22:42:25,752][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 22:42:28,216][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 22:42:28,218][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 22:42:28,224][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 22:42:28,226][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data.
[2025-08-20 22:42:28,226][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets.
[2025-08-20 22:42:28,526][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:42:29,319][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:42:30,111][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:42:30,904][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:42:31,700][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:42:32,493][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:42:33,286][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:42:34,082][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:42:34,879][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:42:35,672][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:42:36,465][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:42:37,262][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:42:38,059][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:42:38,854][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:42:39,648][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:42:40,443][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:42:41,240][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:42:42,038][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:42:42,834][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:42:43,630][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:42:44,424][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:42:45,221][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:42:46,523][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:42:47,317][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:42:48,114][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:42:48,908][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:42:49,702][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:42:50,499][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:42:51,297][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:42:52,095][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:42:52,891][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:42:53,686][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:42:55,248][mllm.training.trainer_common][INFO] - For task: Apply reinforce step
,
            ΔVRAM Allocated: 2.288818359375e-05 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:26,
            Percentage of VRAM taken: 69.91579512747062%,
            
[2025-08-20 22:42:56,190][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/policy_optimizer_state.pt
[2025-08-20 22:42:56,191][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/critic_optimizer_state.pt
[2025-08-20 22:42:57,597][__main__][INFO] - Iteration 938 took 54s (37.52% Gen, 62.47% Train). Generation: 20s, Training: 34s. Estimated remaining time: 42m 55s. Estimated total time: 15h 15m 22s. Time estimates for 10 more iterations: 9m 9s, 100 more iterations: 1h 31m 32s, 500 more iterations: 7h 37m 41s.
[2025-08-20 22:42:57,598][__main__][INFO] - Starting iteration 938.
[2025-08-20 22:43:20,120][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 22:43:20,122][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 22:43:20,128][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 22:43:22,584][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 22:43:22,585][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 22:43:22,591][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 22:43:22,594][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data.
[2025-08-20 22:43:22,594][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets.
[2025-08-20 22:43:22,892][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:43:23,680][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:43:24,471][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:43:25,263][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:43:26,053][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:43:26,845][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:43:27,638][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:43:28,429][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:43:29,223][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:43:30,017][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:43:30,808][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:43:31,603][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:43:32,396][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:43:33,188][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:43:33,982][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:43:34,776][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:43:35,568][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:43:36,362][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:43:37,157][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:43:38,412][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:43:39,205][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:43:40,002][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:43:40,798][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:43:41,592][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:43:42,385][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:43:43,182][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:43:43,979][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:43:44,774][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:43:45,568][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:43:46,362][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:43:47,160][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:43:47,957][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:43:49,566][mllm.training.trainer_common][INFO] - For task: Apply reinforce step
,
            ΔVRAM Allocated: 2.288818359375e-05 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:26,
            Percentage of VRAM taken: 69.91579512747062%,
            
[2025-08-20 22:43:50,486][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/policy_optimizer_state.pt
[2025-08-20 22:43:50,488][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/critic_optimizer_state.pt
[2025-08-20 22:43:51,835][__main__][INFO] - Iteration 939 took 54s (36.98% Gen, 63.01% Train). Generation: 20s, Training: 34s. Estimated remaining time: 30m 36s. Estimated total time: 15h 3m 56s. Time estimates for 10 more iterations: 9m 2s, 100 more iterations: 1h 30m 23s, 500 more iterations: 7h 31m 58s.
[2025-08-20 22:43:51,837][__main__][INFO] - Starting iteration 939.
[2025-08-20 22:44:14,506][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 22:44:14,507][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 22:44:14,513][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 22:44:16,978][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 22:44:16,979][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 22:44:16,986][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 22:44:16,988][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data.
[2025-08-20 22:44:16,988][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets.
[2025-08-20 22:44:17,287][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:44:18,076][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:44:18,868][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:44:19,661][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:44:20,451][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:44:21,246][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:44:22,040][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:44:22,831][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:44:23,627][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:44:24,421][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:44:25,213][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:44:26,007][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:44:26,803][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:44:27,597][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:44:28,390][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:44:29,186][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:44:29,981][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:44:30,774][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:44:31,568][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:44:32,365][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:44:33,160][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:44:33,954][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:44:35,202][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:44:35,995][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:44:36,789][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:44:37,585][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:44:38,382][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:44:39,178][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:44:39,972][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:44:40,766][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:44:41,563][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:44:42,361][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:44:43,927][mllm.training.trainer_common][INFO] - For task: Apply reinforce step
,
            ΔVRAM Allocated: 2.288818359375e-05 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:26,
            Percentage of VRAM taken: 69.91579512747062%,
            
[2025-08-20 22:44:44,878][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/policy_optimizer_state.pt
[2025-08-20 22:44:44,880][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/critic_optimizer_state.pt
[2025-08-20 22:44:46,240][__main__][INFO] - Iteration 940 took 54s (37.15% Gen, 62.84% Train). Generation: 20s, Training: 34s. Estimated remaining time: 32m 28s. Estimated total time: 15h 6m 43s. Time estimates for 10 more iterations: 9m 4s, 100 more iterations: 1h 30m 40s, 500 more iterations: 7h 33m 21s.
[2025-08-20 22:44:46,242][__main__][INFO] - Starting iteration 940.
[2025-08-20 22:45:08,907][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 22:45:08,908][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 22:45:08,915][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 22:45:11,375][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 22:45:11,376][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 22:45:11,383][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 22:45:11,385][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data.
[2025-08-20 22:45:11,386][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets.
[2025-08-20 22:45:11,686][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:45:12,475][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:45:13,270][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:45:14,060][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:45:14,853][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:45:15,647][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:45:16,439][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:45:17,233][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:45:18,029][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:45:18,823][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:45:19,616][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:45:20,411][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:45:21,208][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:45:22,003][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:45:22,796][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:45:24,136][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:45:24,930][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:45:25,725][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:45:26,519][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:45:27,313][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:45:28,109][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:45:28,906][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:45:29,700][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:45:30,494][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:45:31,288][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:45:32,086][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:45:32,882][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:45:33,677][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:45:34,471][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:45:35,267][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:45:36,064][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:45:36,859][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:45:38,452][mllm.training.trainer_common][INFO] - For task: Apply reinforce step
,
            ΔVRAM Allocated: 2.288818359375e-05 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:27,
            Percentage of VRAM taken: 69.91579512747062%,
            
[2025-08-20 22:45:39,374][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/policy_optimizer_state.pt
[2025-08-20 22:45:39,375][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/critic_optimizer_state.pt
[2025-08-20 22:45:40,628][__main__][INFO] - Iteration 941 took 54s (37.19% Gen, 62.81% Train). Generation: 20s, Training: 34s. Estimated remaining time: 31m 15s. Estimated total time: 15h 6m 25s. Time estimates for 10 more iterations: 9m 3s, 100 more iterations: 1h 30m 38s, 500 more iterations: 7h 33m 12s.
[2025-08-20 22:45:40,629][__main__][INFO] - Starting iteration 941.
[2025-08-20 22:46:03,483][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 22:46:03,485][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 22:46:03,491][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 22:46:05,944][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 22:46:05,945][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 22:46:05,951][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 22:46:05,954][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data.
[2025-08-20 22:46:05,954][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets.
[2025-08-20 22:46:06,252][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:46:07,038][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:46:07,832][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:46:08,622][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:46:09,413][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:46:10,205][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:46:10,996][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:46:11,792][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:46:12,584][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:46:13,375][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:46:14,172][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:46:14,965][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:46:15,757][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:46:16,552][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:46:17,346][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:46:18,138][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:46:18,931][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:46:19,726][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:46:20,518][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:46:21,312][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:46:22,109][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:46:23,372][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:46:24,167][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:46:24,961][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:46:25,756][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:46:26,552][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:46:27,350][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:46:28,147][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:46:28,943][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:46:29,738][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:46:30,533][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:46:31,328][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:46:32,995][mllm.training.trainer_common][INFO] - For task: Apply reinforce step
,
            ΔVRAM Allocated: 2.288818359375e-05 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:27,
            Percentage of VRAM taken: 69.91579512747062%,
            
[2025-08-20 22:46:33,939][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/policy_optimizer_state.pt
[2025-08-20 22:46:33,940][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/critic_optimizer_state.pt
[2025-08-20 22:46:35,282][__main__][INFO] - Iteration 942 took 54s (37.35% Gen, 62.65% Train). Generation: 20s, Training: 34s. Estimated remaining time: 34m 48s. Estimated total time: 15h 10m 52s. Time estimates for 10 more iterations: 9m 6s, 100 more iterations: 1h 31m 5s, 500 more iterations: 7h 35m 26s.
[2025-08-20 22:46:35,283][__main__][INFO] - Starting iteration 942.
[2025-08-20 22:46:57,958][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 22:46:57,959][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 22:46:57,965][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 22:47:00,408][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 22:47:00,409][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 22:47:00,416][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 22:47:00,418][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data.
[2025-08-20 22:47:00,419][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets.
[2025-08-20 22:47:00,716][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:47:01,502][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:47:02,294][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:47:03,083][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:47:03,875][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:47:04,666][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:47:05,457][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:47:06,249][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:47:07,040][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:47:07,834][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:47:08,625][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:47:09,416][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:47:10,210][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:47:11,002][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:47:11,794][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:47:12,586][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:47:13,378][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:47:14,173][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:47:14,967][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:47:15,759][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:47:16,552][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:47:17,344][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:47:18,573][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:47:19,366][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:47:20,160][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:47:20,955][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:47:21,753][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:47:22,549][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:47:23,344][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:47:24,139][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:47:24,934][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:47:25,731][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:47:27,294][mllm.training.trainer_common][INFO] - For task: Apply reinforce step
,
            ΔVRAM Allocated: 2.288818359375e-05 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:26,
            Percentage of VRAM taken: 69.91579512747062%,
            
[2025-08-20 22:47:28,225][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/policy_optimizer_state.pt
[2025-08-20 22:47:28,226][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/critic_optimizer_state.pt
[2025-08-20 22:47:29,631][__main__][INFO] - Iteration 943 took 54s (37.22% Gen, 62.78% Train). Generation: 20s, Training: 34s. Estimated remaining time: 28m 48s. Estimated total time: 15h 5m 46s. Time estimates for 10 more iterations: 9m 3s, 100 more iterations: 1h 30m 34s, 500 more iterations: 7h 32m 53s.
[2025-08-20 22:47:29,632][__main__][INFO] - Starting iteration 943.
[2025-08-20 22:47:52,754][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 22:47:52,756][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 22:47:52,762][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 22:47:55,213][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 22:47:55,214][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 22:47:55,220][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 22:47:55,222][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data.
[2025-08-20 22:47:55,223][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets.
[2025-08-20 22:47:55,523][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:47:56,315][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:47:57,106][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:47:57,900][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:47:58,694][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:47:59,487][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:48:00,281][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:48:01,076][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:48:01,870][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:48:02,664][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:48:03,459][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:48:04,256][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:48:05,050][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:48:05,844][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:48:06,640][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:48:07,438][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:48:08,235][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:48:09,030][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:48:09,825][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:48:10,620][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:48:11,418][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:48:12,216][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:48:13,618][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:48:14,415][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:48:15,212][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:48:16,006][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:48:16,801][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:48:17,597][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:48:18,395][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:48:19,193][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:48:19,988][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:48:20,781][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:48:22,387][mllm.training.trainer_common][INFO] - For task: Apply reinforce step
,
            ΔVRAM Allocated: 2.288818359375e-05 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:27,
            Percentage of VRAM taken: 69.91579512747062%,
            
[2025-08-20 22:48:23,338][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/policy_optimizer_state.pt
[2025-08-20 22:48:23,340][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/critic_optimizer_state.pt
[2025-08-20 22:48:24,560][__main__][INFO] - Iteration 944 took 54s (37.62% Gen, 62.37% Train). Generation: 20s, Training: 34s. Estimated remaining time: 37m 29s. Estimated total time: 15h 15m 22s. Time estimates for 10 more iterations: 9m 9s, 100 more iterations: 1h 31m 32s, 500 more iterations: 7h 37m 41s.
[2025-08-20 22:48:24,561][__main__][INFO] - Starting iteration 944.
[2025-08-20 22:48:47,571][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 22:48:47,573][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 22:48:47,579][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 22:48:50,008][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 22:48:50,010][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 22:48:50,016][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 22:48:50,019][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data.
[2025-08-20 22:48:50,019][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets.
[2025-08-20 22:48:50,318][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:48:51,108][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:48:51,902][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:48:52,694][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:48:53,486][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:48:54,283][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:48:55,078][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:48:55,870][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:48:56,665][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:48:57,461][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:48:58,255][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:48:59,049][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:48:59,844][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:49:00,641][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:49:01,437][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:49:02,231][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:49:03,025][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:49:03,821][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:49:04,620][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:49:05,416][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:49:06,211][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:49:07,547][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:49:08,341][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:49:09,138][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:49:09,935][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:49:10,729][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:49:11,524][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:49:12,321][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:49:13,118][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:49:13,916][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:49:14,712][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:49:15,505][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:49:17,088][mllm.training.trainer_common][INFO] - For task: Apply reinforce step
,
            ΔVRAM Allocated: 2.288818359375e-05 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:27,
            Percentage of VRAM taken: 69.91579512747062%,
            
[2025-08-20 22:49:18,012][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/policy_optimizer_state.pt
[2025-08-20 22:49:18,014][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/critic_optimizer_state.pt
[2025-08-20 22:49:19,519][__main__][INFO] - Iteration 945 took 54s (37.44% Gen, 62.56% Train). Generation: 20s, Training: 34s. Estimated remaining time: 37m 8s. Estimated total time: 15h 15m 57s. Time estimates for 10 more iterations: 9m 9s, 100 more iterations: 1h 31m 35s, 500 more iterations: 7h 37m 58s.
[2025-08-20 22:49:19,520][__main__][INFO] - Starting iteration 945.
[2025-08-20 22:49:42,273][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 22:49:42,274][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 22:49:42,281][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 22:49:44,748][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 22:49:44,749][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 22:49:44,756][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 22:49:44,758][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data.
[2025-08-20 22:49:44,759][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets.
[2025-08-20 22:49:45,059][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:49:45,848][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:49:46,643][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:49:47,436][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:49:48,229][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:49:49,025][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:49:49,821][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:49:50,617][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:49:51,411][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:49:52,206][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:49:53,004][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:49:53,800][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:49:54,594][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:49:55,388][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:49:56,655][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:49:57,449][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:49:58,246][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:49:59,041][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:49:59,835][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:50:00,629][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:50:01,424][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:50:02,222][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:50:03,018][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:50:03,813][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:50:04,607][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:50:05,403][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:50:06,202][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:50:07,000][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:50:07,797][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:50:08,592][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:50:09,385][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:50:10,180][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:50:11,760][mllm.training.trainer_common][INFO] - For task: Apply reinforce step
,
            ΔVRAM Allocated: 2.288818359375e-05 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:26,
            Percentage of VRAM taken: 69.91579512747062%,
            
[2025-08-20 22:50:12,712][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/policy_optimizer_state.pt
[2025-08-20 22:50:12,713][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/critic_optimizer_state.pt
[2025-08-20 22:50:14,246][__main__][INFO] - Iteration 946 took 54s (37.08% Gen, 62.91% Train). Generation: 20s, Training: 34s. Estimated remaining time: 32m 22s. Estimated total time: 15h 12m 5s. Time estimates for 10 more iterations: 9m 7s, 100 more iterations: 1h 31m 12s, 500 more iterations: 7h 36m 2s.
[2025-08-20 22:50:14,248][__main__][INFO] - Starting iteration 946.
[2025-08-20 22:50:37,321][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 22:50:37,322][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 22:50:37,328][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 22:50:39,800][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 22:50:39,801][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 22:50:39,807][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 22:50:39,810][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data.
[2025-08-20 22:50:39,810][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets.
[2025-08-20 22:50:40,108][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:50:40,897][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:50:41,688][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:50:42,482][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:50:43,274][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:50:44,069][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:50:44,865][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:50:45,658][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:50:46,451][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:50:47,247][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:50:48,042][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:50:48,835][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:50:49,629][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:50:50,426][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:50:51,220][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:50:52,014][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:50:52,809][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:50:53,607][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:50:54,403][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:50:55,197][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:50:55,990][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:50:57,359][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:50:58,152][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:50:58,946][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:50:59,744][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:51:00,540][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:51:01,334][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:51:02,128][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:51:02,924][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:51:03,719][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:51:04,511][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:51:05,304][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:51:06,891][mllm.training.trainer_common][INFO] - For task: Apply reinforce step
,
            ΔVRAM Allocated: 2.288818359375e-05 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:27,
            Percentage of VRAM taken: 69.91579512747062%,
            
[2025-08-20 22:51:07,829][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/policy_optimizer_state.pt
[2025-08-20 22:51:07,830][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/critic_optimizer_state.pt
[2025-08-20 22:51:09,050][__main__][INFO] - Iteration 947 took 54s (37.63% Gen, 62.37% Train). Generation: 20s, Training: 34s. Estimated remaining time: 32m 44s. Estimated total time: 15h 13m 22s. Time estimates for 10 more iterations: 9m 8s, 100 more iterations: 1h 31m 20s, 500 more iterations: 7h 36m 41s.
[2025-08-20 22:51:09,052][__main__][INFO] - Starting iteration 947.
[2025-08-20 22:51:31,644][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 22:51:31,646][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 22:51:31,652][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 22:51:34,107][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 22:51:34,108][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 22:51:34,115][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 22:51:34,117][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data.
[2025-08-20 22:51:34,118][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets.
[2025-08-20 22:51:34,416][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:51:35,207][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:51:35,998][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:51:36,791][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:51:37,584][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:51:38,375][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:51:39,170][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:51:39,964][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:51:40,756][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:51:41,550][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:51:42,344][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:51:43,137][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:51:43,931][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:51:44,727][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:51:45,520][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:51:46,314][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:51:47,109][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:51:47,904][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:51:48,697][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:51:49,490][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:51:50,287][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:51:51,083][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:51:51,876][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:51:53,219][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:51:54,012][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:51:54,807][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:51:55,603][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:51:56,396][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:51:57,189][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:51:57,983][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:51:58,776][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:51:59,568][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:52:01,181][mllm.training.trainer_common][INFO] - For task: Apply reinforce step
,
            ΔVRAM Allocated: 2.288818359375e-05 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:27,
            Percentage of VRAM taken: 69.91579512747062%,
            
[2025-08-20 22:52:02,104][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/policy_optimizer_state.pt
[2025-08-20 22:52:02,105][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/critic_optimizer_state.pt
[2025-08-20 22:52:03,441][__main__][INFO] - Iteration 948 took 54s (37.02% Gen, 62.98% Train). Generation: 20s, Training: 34s. Estimated remaining time: 24m 55s. Estimated total time: 15h 6m 28s. Time estimates for 10 more iterations: 9m 3s, 100 more iterations: 1h 30m 38s, 500 more iterations: 7h 33m 14s.
[2025-08-20 22:52:03,442][__main__][INFO] - Starting iteration 948.
[2025-08-20 22:52:26,073][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 22:52:26,075][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 22:52:26,081][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 22:52:28,542][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 22:52:28,543][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 22:52:28,550][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 22:52:28,552][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data.
[2025-08-20 22:52:28,552][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets.
[2025-08-20 22:52:28,852][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:52:29,641][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:52:30,435][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:52:31,228][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:52:32,021][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:52:32,816][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:52:33,611][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:52:34,405][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:52:35,198][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:52:35,994][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:52:36,791][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:52:37,584][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:52:38,377][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:52:39,173][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:52:39,969][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:52:40,763][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:52:41,557][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:52:42,353][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:52:43,151][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:52:43,946][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:52:45,203][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:52:45,996][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:52:46,792][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:52:47,590][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:52:48,387][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:52:49,181][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:52:49,975][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:52:50,770][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:52:51,566][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:52:52,359][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:52:53,152][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:52:53,947][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:52:55,517][mllm.training.trainer_common][INFO] - For task: Apply reinforce step
,
            ΔVRAM Allocated: 2.288818359375e-05 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:26,
            Percentage of VRAM taken: 69.91579512747062%,
            
[2025-08-20 22:52:56,431][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/policy_optimizer_state.pt
[2025-08-20 22:52:56,432][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/critic_optimizer_state.pt
[2025-08-20 22:52:57,733][__main__][INFO] - Iteration 949 took 54s (37.17% Gen, 62.83% Train). Generation: 20s, Training: 34s. Estimated remaining time: 22m 24s. Estimated total time: 15h 4m 50s. Time estimates for 10 more iterations: 9m 2s, 100 more iterations: 1h 30m 29s, 500 more iterations: 7h 32m 25s.
[2025-08-20 22:52:57,735][__main__][INFO] - Starting iteration 949.
[2025-08-20 22:53:20,371][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 22:53:20,372][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 22:53:20,379][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 22:53:22,860][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 22:53:22,861][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 22:53:22,868][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 22:53:22,870][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data.
[2025-08-20 22:53:22,871][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets.
[2025-08-20 22:53:23,169][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:53:23,958][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:53:24,750][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:53:25,541][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:53:26,337][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:53:27,130][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:53:27,923][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:53:28,718][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:53:29,514][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:53:30,308][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:53:31,101][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:53:31,896][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:53:32,691][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:53:33,485][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:53:34,279][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:53:35,076][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:53:35,873][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:53:36,668][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:53:37,987][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:53:38,779][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:53:39,575][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:53:40,372][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:53:41,168][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:53:41,962][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:53:42,756][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:53:43,553][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:53:44,350][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:53:45,145][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:53:45,939][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:53:46,733][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:53:47,528][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:53:48,322][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:53:49,935][mllm.training.trainer_common][INFO] - For task: Apply reinforce step
,
            ΔVRAM Allocated: 2.288818359375e-05 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:27,
            Percentage of VRAM taken: 69.91579512747062%,
            
[2025-08-20 22:53:50,840][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/policy_optimizer_state.pt
[2025-08-20 22:53:50,841][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/critic_optimizer_state.pt
[2025-08-20 22:53:52,125][__main__][INFO] - Iteration 950 took 54s (37.08% Gen, 62.92% Train). Generation: 20s, Training: 34s. Estimated remaining time: 23m 9s. Estimated total time: 15h 6m 30s. Time estimates for 10 more iterations: 9m 3s, 100 more iterations: 1h 30m 39s, 500 more iterations: 7h 33m 15s.
[2025-08-20 22:53:52,127][__main__][INFO] - Starting iteration 950.
[2025-08-20 22:54:14,786][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 22:54:14,787][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 22:54:14,794][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 22:54:17,247][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 22:54:17,248][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 22:54:17,255][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 22:54:17,257][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data.
[2025-08-20 22:54:17,257][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets.
[2025-08-20 22:54:17,556][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:54:18,344][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:54:19,139][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:54:19,931][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:54:20,724][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:54:21,520][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:54:22,314][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:54:23,106][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:54:23,901][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:54:24,696][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:54:25,489][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:54:26,282][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:54:27,079][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:54:27,875][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:54:28,668][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:54:29,462][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:54:30,259][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:54:31,061][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:54:31,858][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:54:32,653][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:54:33,447][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:54:34,240][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:54:35,037][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:54:36,276][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:54:37,069][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:54:37,863][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:54:38,658][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:54:39,456][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:54:40,253][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:54:41,046][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:54:41,838][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:54:42,634][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:54:44,211][mllm.training.trainer_common][INFO] - For task: Apply reinforce step
,
            ΔVRAM Allocated: 2.288818359375e-05 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:26,
            Percentage of VRAM taken: 69.91579512747062%,
            
[2025-08-20 22:54:45,195][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/policy_optimizer_state.pt
[2025-08-20 22:54:45,197][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/critic_optimizer_state.pt
[2025-08-20 22:54:48,901][__main__][INFO] - Iteration 951 took 56s (35.61% Gen, 60.23% Train). Generation: 20s, Training: 34s. Estimated remaining time: 1h 1m 56s. Estimated total time: 15h 46m 13s. Time estimates for 10 more iterations: 9m 27s, 100 more iterations: 1h 34m 37s, 500 more iterations: 7h 53m 6s.
[2025-08-20 22:54:48,903][__main__][INFO] - Starting iteration 951.
[2025-08-20 22:55:11,809][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 22:55:11,810][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 22:55:11,817][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 22:55:14,285][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 22:55:14,287][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 22:55:14,293][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 22:55:14,295][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data.
[2025-08-20 22:55:14,296][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets.
[2025-08-20 22:55:14,595][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:55:15,385][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:55:16,182][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:55:16,974][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:55:17,766][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:55:18,561][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:55:19,354][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:55:20,146][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:55:20,942][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:55:21,736][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:55:22,529][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:55:23,323][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:55:24,120][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:55:24,914][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:55:25,707][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:55:27,008][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:55:27,801][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:55:28,597][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:55:29,392][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:55:30,185][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:55:30,981][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:55:31,777][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:55:32,573][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:55:33,366][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:55:34,161][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:55:34,957][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:55:35,750][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:55:36,543][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:55:37,337][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:55:38,132][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:55:38,925][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:55:39,718][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:55:41,340][mllm.training.trainer_common][INFO] - For task: Apply reinforce step
,
            ΔVRAM Allocated: 2.288818359375e-05 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:27,
            Percentage of VRAM taken: 69.91579512747062%,
            
[2025-08-20 22:55:42,233][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/policy_optimizer_state.pt
[2025-08-20 22:55:42,235][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/critic_optimizer_state.pt
[2025-08-20 22:55:43,587][__main__][INFO] - Iteration 952 took 54s (37.37% Gen, 62.62% Train). Generation: 20s, Training: 34s. Estimated remaining time: 26m 10s. Estimated total time: 15h 11m 23s. Time estimates for 10 more iterations: 9m 6s, 100 more iterations: 1h 31m 8s, 500 more iterations: 7h 35m 41s.
[2025-08-20 22:55:43,588][__main__][INFO] - Starting iteration 952.
[2025-08-20 22:56:06,276][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 22:56:06,277][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 22:56:06,284][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 22:56:08,770][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 22:56:08,772][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 22:56:08,778][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 22:56:08,780][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data.
[2025-08-20 22:56:08,781][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets.
[2025-08-20 22:56:09,084][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:56:09,874][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:56:10,666][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:56:11,460][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:56:12,252][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:56:13,046][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:56:13,841][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:56:14,634][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:56:15,428][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:56:16,224][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:56:17,017][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:56:17,811][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:56:18,606][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:56:19,403][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:56:20,198][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:56:20,992][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:56:21,787][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:56:22,583][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:56:23,379][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:56:24,172][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:56:24,966][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:56:25,763][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:56:27,098][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:56:27,891][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:56:28,685][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:56:29,481][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:56:30,276][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:56:31,069][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:56:31,861][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:56:32,656][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:56:33,449][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:56:34,242][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:56:35,857][mllm.training.trainer_common][INFO] - For task: Apply reinforce step
,
            ΔVRAM Allocated: 2.288818359375e-05 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:27,
            Percentage of VRAM taken: 69.91579512747062%,
            
[2025-08-20 22:56:36,806][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/policy_optimizer_state.pt
[2025-08-20 22:56:36,807][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/critic_optimizer_state.pt
[2025-08-20 22:56:38,013][__main__][INFO] - Iteration 953 took 54s (37.12% Gen, 62.88% Train). Generation: 20s, Training: 34s. Estimated remaining time: 20m 57s. Estimated total time: 15h 7m 4s. Time estimates for 10 more iterations: 9m 4s, 100 more iterations: 1h 30m 42s, 500 more iterations: 7h 33m 32s.
[2025-08-20 22:56:38,014][__main__][INFO] - Starting iteration 953.
[2025-08-20 22:57:00,626][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 22:57:00,628][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 22:57:00,634][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 22:57:03,110][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 22:57:03,112][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 22:57:03,118][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 22:57:03,120][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data.
[2025-08-20 22:57:03,121][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets.
[2025-08-20 22:57:03,419][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:57:04,210][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:57:05,003][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:57:05,794][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:57:06,589][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:57:07,383][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:57:08,176][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:57:08,970][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:57:09,767][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:57:10,562][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:57:11,354][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:57:12,149][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:57:12,946][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:57:13,742][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:57:14,535][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:57:15,329][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:57:16,126][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:57:16,922][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:57:17,715][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:57:18,509][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:57:19,306][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:57:20,540][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:57:21,333][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:57:22,128][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:57:22,925][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:57:23,720][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:57:24,511][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:57:25,305][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:57:26,100][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:57:26,893][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:57:27,685][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:57:28,482][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:57:30,084][mllm.training.trainer_common][INFO] - For task: Apply reinforce step
,
            ΔVRAM Allocated: 2.288818359375e-05 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:26,
            Percentage of VRAM taken: 69.91579512747062%,
            
[2025-08-20 22:57:31,013][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/policy_optimizer_state.pt
[2025-08-20 22:57:31,014][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/critic_optimizer_state.pt
[2025-08-20 22:57:32,457][__main__][INFO] - Iteration 954 took 54s (36.97% Gen, 63.03% Train). Generation: 20s, Training: 34s. Estimated remaining time: 20m 20s. Estimated total time: 15h 7m 21s. Time estimates for 10 more iterations: 9m 4s, 100 more iterations: 1h 30m 44s, 500 more iterations: 7h 33m 40s.
[2025-08-20 22:57:32,458][__main__][INFO] - Starting iteration 954.
[2025-08-20 22:57:54,958][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 22:57:54,960][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 22:57:54,966][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 22:57:57,428][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 22:57:57,430][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 22:57:57,436][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 22:57:57,438][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data.
[2025-08-20 22:57:57,439][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets.
[2025-08-20 22:57:57,738][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:57:58,530][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:57:59,322][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:58:00,114][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:58:00,909][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:58:01,702][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:58:02,495][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:58:03,292][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:58:04,087][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:58:04,880][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:58:05,674][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:58:06,470][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:58:07,266][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:58:08,060][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:58:08,853][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:58:09,651][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:58:10,448][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:58:11,243][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:58:12,542][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:58:13,335][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:58:14,131][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:58:14,928][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:58:15,724][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:58:16,519][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:58:17,313][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:58:18,108][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:58:18,903][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:58:19,696][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:58:20,489][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:58:21,285][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:58:22,078][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:58:22,872][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:58:24,446][mllm.training.trainer_common][INFO] - For task: Apply reinforce step
,
            ΔVRAM Allocated: 2.288818359375e-05 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:26,
            Percentage of VRAM taken: 69.91579512747062%,
            
[2025-08-20 22:58:25,357][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/policy_optimizer_state.pt
[2025-08-20 22:58:25,359][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/critic_optimizer_state.pt
[2025-08-20 22:58:26,822][__main__][INFO] - Iteration 955 took 54s (36.90% Gen, 63.10% Train). Generation: 20s, Training: 34s. Estimated remaining time: 18m 8s. Estimated total time: 15h 6m 4s. Time estimates for 10 more iterations: 9m 3s, 100 more iterations: 1h 30m 36s, 500 more iterations: 7h 33m 2s.
[2025-08-20 22:58:26,824][__main__][INFO] - Starting iteration 955.
[2025-08-20 22:58:49,492][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 22:58:49,494][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 22:58:49,500][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 22:58:51,972][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 22:58:51,973][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 22:58:51,980][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 22:58:51,982][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data.
[2025-08-20 22:58:51,983][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets.
[2025-08-20 22:58:52,282][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:58:53,074][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:58:53,866][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:58:54,658][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:58:55,453][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:58:56,247][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:58:57,041][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:58:57,836][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:58:58,633][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:58:59,428][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:59:00,222][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:59:01,016][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:59:01,814][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:59:02,610][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:59:03,404][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:59:04,198][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:59:04,994][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:59:05,791][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:59:06,587][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:59:07,380][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:59:08,175][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:59:08,973][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:59:09,771][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:59:10,567][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:59:11,855][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:59:12,646][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:59:13,439][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:59:14,230][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:59:15,025][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:59:15,816][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:59:16,609][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:59:17,404][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:59:19,026][mllm.training.trainer_common][INFO] - For task: Apply reinforce step
,
            ΔVRAM Allocated: 2.288818359375e-05 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:27,
            Percentage of VRAM taken: 69.91579512747062%,
            
[2025-08-20 22:59:19,934][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/policy_optimizer_state.pt
[2025-08-20 22:59:19,936][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/critic_optimizer_state.pt
[2025-08-20 22:59:21,197][__main__][INFO] - Iteration 956 took 54s (37.16% Gen, 62.84% Train). Generation: 20s, Training: 34s. Estimated remaining time: 17m 22s. Estimated total time: 15h 6m 12s. Time estimates for 10 more iterations: 9m 3s, 100 more iterations: 1h 30m 37s, 500 more iterations: 7h 33m 6s.
[2025-08-20 22:59:21,198][__main__][INFO] - Starting iteration 956.
[2025-08-20 22:59:44,933][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 22:59:44,934][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 22:59:44,940][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 22:59:47,418][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 22:59:47,420][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 22:59:47,426][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 22:59:47,428][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data.
[2025-08-20 22:59:47,429][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets.
[2025-08-20 22:59:47,727][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:59:48,518][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:59:49,313][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:59:50,106][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:59:50,900][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:59:51,697][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:59:52,492][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:59:53,285][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:59:54,080][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:59:54,877][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:59:55,674][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:59:56,467][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:59:57,262][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:59:58,057][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:59:58,855][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 22:59:59,651][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 23:00:00,914][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 23:00:01,711][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 23:00:02,505][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 23:00:03,299][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 23:00:04,096][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 23:00:04,895][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 23:00:05,691][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 23:00:06,485][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 23:00:07,278][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 23:00:08,073][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 23:00:08,868][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 23:00:09,662][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 23:00:10,455][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 23:00:11,253][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 23:00:12,050][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 23:00:12,844][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 23:00:14,419][mllm.training.trainer_common][INFO] - For task: Apply reinforce step
,
            ΔVRAM Allocated: 2.288818359375e-05 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:26,
            Percentage of VRAM taken: 69.91579512747062%,
            
[2025-08-20 23:00:15,358][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/policy_optimizer_state.pt
[2025-08-20 23:00:15,359][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/critic_optimizer_state.pt
[2025-08-20 23:00:16,720][__main__][INFO] - Iteration 957 took 55s (38.31% Gen, 61.69% Train). Generation: 21s, Training: 34s. Estimated remaining time: 35m 35s. Estimated total time: 15h 25m 21s. Time estimates for 10 more iterations: 9m 15s, 100 more iterations: 1h 32m 32s, 500 more iterations: 7h 42m 40s.
[2025-08-20 23:00:16,721][__main__][INFO] - Starting iteration 957.
[2025-08-20 23:00:39,443][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 23:00:39,444][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 23:00:39,451][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 23:00:41,896][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 23:00:41,898][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 23:00:41,904][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 23:00:41,907][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data.
[2025-08-20 23:00:41,907][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets.
[2025-08-20 23:00:42,208][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 23:00:43,000][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 23:00:43,793][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 23:00:44,586][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 23:00:45,382][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 23:00:46,178][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 23:00:46,972][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 23:00:47,765][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 23:00:48,561][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 23:00:49,357][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 23:00:50,151][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 23:00:50,945][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 23:00:51,741][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 23:00:52,538][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 23:00:53,335][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 23:00:54,129][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 23:00:54,923][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 23:00:55,718][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 23:00:56,515][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 23:00:57,309][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 23:00:58,104][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 23:00:59,436][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 23:01:00,231][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 23:01:01,022][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 23:01:01,816][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 23:01:02,611][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 23:01:03,402][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 23:01:04,196][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 23:01:04,990][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 23:01:05,783][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 23:01:06,576][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 23:01:07,373][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 23:01:08,944][mllm.training.trainer_common][INFO] - For task: Apply reinforce step
,
            ΔVRAM Allocated: 2.288818359375e-05 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:26,
            Percentage of VRAM taken: 69.91579512747062%,
            
[2025-08-20 23:01:09,827][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/policy_optimizer_state.pt
[2025-08-20 23:01:09,828][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/critic_optimizer_state.pt
[2025-08-20 23:01:11,163][__main__][INFO] - Iteration 958 took 54s (37.27% Gen, 62.73% Train). Generation: 20s, Training: 34s. Estimated remaining time: 16m 41s. Estimated total time: 15h 7m 21s. Time estimates for 10 more iterations: 9m 4s, 100 more iterations: 1h 30m 44s, 500 more iterations: 7h 33m 40s.
[2025-08-20 23:01:11,165][__main__][INFO] - Starting iteration 958.
[2025-08-20 23:01:34,296][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 23:01:34,298][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 23:01:34,304][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 23:01:36,742][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 23:01:36,744][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 23:01:36,750][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 23:01:36,752][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data.
[2025-08-20 23:01:36,753][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets.
[2025-08-20 23:01:37,051][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 23:01:37,841][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 23:01:38,631][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 23:01:39,424][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 23:01:40,218][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 23:01:41,010][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 23:01:41,803][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 23:01:42,597][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 23:01:43,389][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 23:01:44,183][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 23:01:44,978][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 23:01:45,770][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 23:01:46,564][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 23:01:47,360][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 23:01:48,153][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 23:01:48,945][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 23:01:49,741][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 23:01:50,536][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 23:01:51,328][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 23:01:52,122][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 23:01:52,919][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 23:01:53,713][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 23:01:54,506][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 23:01:55,785][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 23:01:56,580][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 23:01:57,375][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 23:01:58,168][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 23:01:58,961][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 23:01:59,757][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 23:02:00,552][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 23:02:01,345][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 23:02:02,139][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 23:02:03,800][mllm.training.trainer_common][INFO] - For task: Apply reinforce step
,
            ΔVRAM Allocated: 2.288818359375e-05 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:27,
            Percentage of VRAM taken: 69.91579512747062%,
            
[2025-08-20 23:02:04,701][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/policy_optimizer_state.pt
[2025-08-20 23:02:04,702][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/critic_optimizer_state.pt
[2025-08-20 23:02:06,068][__main__][INFO] - Iteration 959 took 54s (37.71% Gen, 62.29% Train). Generation: 20s, Training: 34s. Estimated remaining time: 23m 28s. Estimated total time: 15h 15m 3s. Time estimates for 10 more iterations: 9m 9s, 100 more iterations: 1h 31m 30s, 500 more iterations: 7h 37m 31s.
[2025-08-20 23:02:06,070][__main__][INFO] - Starting iteration 959.
[2025-08-20 23:02:28,747][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 23:02:28,749][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 23:02:28,755][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 23:02:31,211][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 23:02:31,212][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 23:02:31,218][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 23:02:31,221][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data.
[2025-08-20 23:02:31,221][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets.
[2025-08-20 23:02:31,520][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 23:02:32,309][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 23:02:33,105][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 23:02:33,899][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 23:02:34,691][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 23:02:35,487][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 23:02:36,282][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 23:02:37,075][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 23:02:37,869][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 23:02:38,666][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 23:02:39,462][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 23:02:40,256][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 23:02:41,049][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 23:02:41,845][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 23:02:42,642][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 23:02:43,435][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 23:02:44,229][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 23:02:45,024][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 23:02:45,822][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 23:02:47,052][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 23:02:47,846][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 23:02:48,642][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 23:02:49,436][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 23:02:50,228][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 23:02:51,023][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 23:02:51,818][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 23:02:52,611][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 23:02:53,403][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 23:02:54,201][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 23:02:54,997][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 23:02:55,790][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 23:02:56,584][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 23:02:58,152][mllm.training.trainer_common][INFO] - For task: Apply reinforce step
,
            ΔVRAM Allocated: 2.288818359375e-05 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:26,
            Percentage of VRAM taken: 69.91579512747062%,
            
[2025-08-20 23:02:59,090][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/policy_optimizer_state.pt
[2025-08-20 23:02:59,092][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/critic_optimizer_state.pt
[2025-08-20 23:03:00,390][__main__][INFO] - Iteration 960 took 54s (37.23% Gen, 62.77% Train). Generation: 20s, Training: 34s. Estimated remaining time: 12m 50s. Estimated total time: 15h 5m 19s. Time estimates for 10 more iterations: 9m 3s, 100 more iterations: 1h 30m 31s, 500 more iterations: 7h 32m 39s.
[2025-08-20 23:03:00,391][__main__][INFO] - Starting iteration 960.
[2025-08-20 23:03:23,604][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 23:03:23,605][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 23:03:23,611][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 23:03:26,069][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 23:03:26,071][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 23:03:26,077][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 23:03:26,079][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data.
[2025-08-20 23:03:26,080][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets.
[2025-08-20 23:03:26,381][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 23:03:27,172][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 23:03:27,969][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 23:03:28,764][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 23:03:29,558][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 23:03:30,352][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 23:03:31,149][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 23:03:31,946][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 23:03:32,741][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 23:03:33,535][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 23:03:34,331][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 23:03:35,127][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 23:03:35,922][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 23:03:36,716][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 23:03:37,511][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 23:03:38,309][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 23:03:39,107][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 23:03:40,419][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 23:03:41,213][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 23:03:42,009][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 23:03:42,807][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 23:03:43,603][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 23:03:44,395][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 23:03:45,189][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 23:03:45,985][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 23:03:46,781][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 23:03:47,575][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 23:03:48,369][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 23:03:49,166][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 23:03:49,963][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 23:03:50,758][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 23:03:51,552][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 23:03:53,161][mllm.training.trainer_common][INFO] - For task: Apply reinforce step
,
            ΔVRAM Allocated: 2.288818359375e-05 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:27,
            Percentage of VRAM taken: 69.91579512747062%,
            
[2025-08-20 23:03:54,077][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/policy_optimizer_state.pt
[2025-08-20 23:03:54,078][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/critic_optimizer_state.pt
[2025-08-20 23:03:55,347][__main__][INFO] - Iteration 961 took 54s (37.74% Gen, 62.26% Train). Generation: 20s, Training: 34s. Estimated remaining time: 22m 31s. Estimated total time: 15h 15m 55s. Time estimates for 10 more iterations: 9m 9s, 100 more iterations: 1h 31m 35s, 500 more iterations: 7h 37m 57s.
[2025-08-20 23:03:55,348][__main__][INFO] - Starting iteration 961.
[2025-08-20 23:04:17,914][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 23:04:17,916][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 23:04:17,922][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 23:04:20,344][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 23:04:20,346][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 23:04:20,352][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 23:04:20,355][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data.
[2025-08-20 23:04:20,355][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets.
[2025-08-20 23:04:20,654][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 23:04:21,446][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 23:04:22,237][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 23:04:23,033][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 23:04:23,827][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 23:04:24,620][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 23:04:25,414][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 23:04:26,210][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 23:04:27,004][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 23:04:27,798][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 23:04:28,593][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 23:04:29,389][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 23:04:30,183][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 23:04:30,976][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 23:04:31,772][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 23:04:32,570][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 23:04:33,366][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 23:04:34,159][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 23:04:34,954][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 23:04:35,750][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 23:04:36,547][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 23:04:37,869][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 23:04:38,664][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 23:04:39,456][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 23:04:40,249][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 23:04:41,045][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 23:04:41,837][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 23:04:42,631][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 23:04:43,424][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 23:04:44,218][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 23:04:45,011][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 23:04:45,807][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 23:04:47,443][mllm.training.trainer_common][INFO] - For task: Apply reinforce step
,
            ΔVRAM Allocated: 2.288818359375e-05 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:27,
            Percentage of VRAM taken: 69.91579512747062%,
            
[2025-08-20 23:04:48,323][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/policy_optimizer_state.pt
[2025-08-20 23:04:48,324][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/critic_optimizer_state.pt
[2025-08-20 23:04:49,607][__main__][INFO] - Iteration 962 took 54s (37.11% Gen, 62.89% Train). Generation: 20s, Training: 34s. Estimated remaining time: 9m 59s. Estimated total time: 15h 4m 18s. Time estimates for 10 more iterations: 9m 2s, 100 more iterations: 1h 30m 25s, 500 more iterations: 7h 32m 9s.
[2025-08-20 23:04:49,609][__main__][INFO] - Starting iteration 962.
[2025-08-20 23:05:12,612][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 23:05:12,613][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 23:05:12,620][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 23:05:15,091][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 23:05:15,092][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 23:05:15,098][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 23:05:15,101][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data.
[2025-08-20 23:05:15,101][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets.
[2025-08-20 23:05:15,400][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 23:05:16,192][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 23:05:16,984][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 23:05:17,777][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 23:05:18,572][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 23:05:19,365][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 23:05:20,158][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 23:05:20,954][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 23:05:21,749][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 23:05:22,542][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 23:05:23,336][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 23:05:24,132][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 23:05:24,927][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 23:05:25,720][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 23:05:26,974][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 23:05:27,768][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 23:05:28,561][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 23:05:29,355][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 23:05:30,152][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 23:05:30,947][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 23:05:31,740][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 23:05:32,532][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 23:05:33,327][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 23:05:34,119][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 23:05:34,912][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 23:05:35,707][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 23:05:36,499][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 23:05:37,293][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 23:05:38,088][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 23:05:38,883][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 23:05:39,675][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 23:05:40,470][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 23:05:42,048][mllm.training.trainer_common][INFO] - For task: Apply reinforce step
,
            ΔVRAM Allocated: 2.288818359375e-05 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:26,
            Percentage of VRAM taken: 69.91579512747062%,
            
[2025-08-20 23:05:42,960][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/policy_optimizer_state.pt
[2025-08-20 23:05:42,961][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/critic_optimizer_state.pt
[2025-08-20 23:05:44,407][__main__][INFO] - Iteration 963 took 54s (37.48% Gen, 62.52% Train). Generation: 20s, Training: 34s. Estimated remaining time: 18m 4s. Estimated total time: 15h 13m 17s. Time estimates for 10 more iterations: 9m 7s, 100 more iterations: 1h 31m 19s, 500 more iterations: 7h 36m 38s.
[2025-08-20 23:05:44,408][__main__][INFO] - Starting iteration 963.
[2025-08-20 23:06:06,980][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 23:06:06,981][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 23:06:06,988][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 23:06:09,440][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 23:06:09,441][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 23:06:09,448][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 23:06:09,451][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data.
[2025-08-20 23:06:09,451][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets.
[2025-08-20 23:06:09,752][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 23:06:10,543][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 23:06:11,338][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 23:06:12,132][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 23:06:12,925][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 23:06:13,720][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 23:06:14,516][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 23:06:15,310][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 23:06:16,105][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 23:06:16,899][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 23:06:17,696][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 23:06:18,490][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 23:06:19,283][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 23:06:20,078][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 23:06:20,873][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 23:06:21,666][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 23:06:22,459][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 23:06:23,255][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 23:06:24,051][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 23:06:25,395][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 23:06:26,187][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 23:06:26,979][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 23:06:27,774][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 23:06:28,568][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 23:06:29,360][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 23:06:30,154][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 23:06:30,950][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 23:06:31,743][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 23:06:32,536][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 23:06:33,332][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 23:06:34,128][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 23:06:34,921][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 23:06:36,526][mllm.training.trainer_common][INFO] - For task: Apply reinforce step
,
            ΔVRAM Allocated: 2.288818359375e-05 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:27,
            Percentage of VRAM taken: 69.91579512747062%,
            
[2025-08-20 23:06:37,489][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/policy_optimizer_state.pt
[2025-08-20 23:06:37,491][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/critic_optimizer_state.pt
[2025-08-20 23:06:38,727][__main__][INFO] - Iteration 964 took 54s (37.05% Gen, 62.95% Train). Generation: 20s, Training: 34s. Estimated remaining time: 9m 10s. Estimated total time: 15h 5m 18s. Time estimates for 10 more iterations: 9m 3s, 100 more iterations: 1h 30m 31s, 500 more iterations: 7h 32m 39s.
[2025-08-20 23:06:38,728][__main__][INFO] - Starting iteration 964.
[2025-08-20 23:07:01,424][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 23:07:01,425][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 23:07:01,432][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 23:07:03,879][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 23:07:03,881][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 23:07:03,887][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 23:07:03,890][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data.
[2025-08-20 23:07:03,890][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets.
[2025-08-20 23:07:04,189][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 23:07:04,981][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 23:07:05,774][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 23:07:06,566][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 23:07:07,362][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 23:07:08,157][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 23:07:08,950][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 23:07:09,744][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 23:07:10,541][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 23:07:11,337][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 23:07:12,131][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 23:07:12,924][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 23:07:13,721][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 23:07:14,517][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 23:07:15,311][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 23:07:16,105][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 23:07:16,901][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 23:07:17,698][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 23:07:18,495][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 23:07:19,289][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 23:07:20,083][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 23:07:20,878][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 23:07:21,673][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 23:07:22,465][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 23:07:23,258][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 23:07:24,054][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 23:07:25,429][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 23:07:26,220][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 23:07:27,015][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 23:07:27,808][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 23:07:28,600][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 23:07:29,395][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 23:07:31,017][mllm.training.trainer_common][INFO] - For task: Apply reinforce step
,
            ΔVRAM Allocated: 2.288818359375e-05 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:27,
            Percentage of VRAM taken: 69.91579512747062%,
            
[2025-08-20 23:07:31,992][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/policy_optimizer_state.pt
[2025-08-20 23:07:31,993][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/critic_optimizer_state.pt
[2025-08-20 23:07:33,317][__main__][INFO] - Iteration 965 took 54s (37.10% Gen, 62.90% Train). Generation: 20s, Training: 34s. Estimated remaining time: 12m 46s. Estimated total time: 15h 9m 48s. Time estimates for 10 more iterations: 9m 5s, 100 more iterations: 1h 30m 58s, 500 more iterations: 7h 34m 54s.
[2025-08-20 23:07:33,319][__main__][INFO] - Starting iteration 965.
[2025-08-20 23:07:56,053][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 23:07:56,054][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 23:07:56,061][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 23:07:58,500][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 23:07:58,502][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 23:07:58,509][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 23:07:58,511][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data.
[2025-08-20 23:07:58,512][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets.
[2025-08-20 23:07:58,810][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 23:07:59,603][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 23:08:00,395][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 23:08:01,187][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 23:08:01,983][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 23:08:02,776][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 23:08:03,569][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 23:08:04,364][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 23:08:05,160][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 23:08:05,953][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 23:08:06,747][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 23:08:07,543][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 23:08:08,339][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 23:08:09,132][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 23:08:09,927][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 23:08:10,724][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 23:08:11,521][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 23:08:12,315][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 23:08:13,562][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 23:08:14,356][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 23:08:15,149][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 23:08:15,941][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 23:08:16,736][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 23:08:17,528][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 23:08:18,322][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 23:08:19,117][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 23:08:19,909][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 23:08:20,702][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 23:08:21,498][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 23:08:22,293][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 23:08:23,086][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 23:08:23,880][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 23:08:25,471][mllm.training.trainer_common][INFO] - For task: Apply reinforce step
,
            ΔVRAM Allocated: 2.288818359375e-05 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:26,
            Percentage of VRAM taken: 69.91579512747062%,
            
[2025-08-20 23:08:26,402][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/policy_optimizer_state.pt
[2025-08-20 23:08:26,404][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/critic_optimizer_state.pt
[2025-08-20 23:08:27,684][__main__][INFO] - Iteration 966 took 54s (37.29% Gen, 62.71% Train). Generation: 20s, Training: 34s. Estimated remaining time: 8m 8s. Estimated total time: 15h 6m 5s. Time estimates for 10 more iterations: 9m 3s, 100 more iterations: 1h 30m 36s, 500 more iterations: 7h 33m 2s.
[2025-08-20 23:08:27,686][__main__][INFO] - Starting iteration 966.
[2025-08-20 23:08:50,576][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 23:08:50,577][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 23:08:50,583][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 23:08:53,050][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 23:08:53,051][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 23:08:53,058][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 23:08:53,060][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data.
[2025-08-20 23:08:53,061][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets.
[2025-08-20 23:08:53,361][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 23:08:54,151][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 23:08:54,946][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 23:08:55,740][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 23:08:56,533][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 23:08:57,329][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 23:08:58,126][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 23:08:58,921][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 23:08:59,715][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 23:09:00,509][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 23:09:01,307][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 23:09:02,102][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 23:09:02,896][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 23:09:03,690][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 23:09:04,487][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 23:09:05,285][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 23:09:06,082][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 23:09:06,876][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 23:09:07,671][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 23:09:08,467][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 23:09:09,766][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 23:09:10,558][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 23:09:11,349][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 23:09:12,143][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 23:09:12,936][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 23:09:13,727][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 23:09:14,520][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 23:09:15,314][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 23:09:16,107][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 23:09:16,902][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 23:09:17,696][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 23:09:18,488][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 23:09:20,099][mllm.training.trainer_common][INFO] - For task: Apply reinforce step
,
            ΔVRAM Allocated: 2.288818359375e-05 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:27,
            Percentage of VRAM taken: 69.91579512747062%,
            
[2025-08-20 23:09:21,049][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/policy_optimizer_state.pt
[2025-08-20 23:09:21,051][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/critic_optimizer_state.pt
[2025-08-20 23:09:22,452][__main__][INFO] - Iteration 967 took 54s (37.28% Gen, 62.72% Train). Generation: 20s, Training: 34s. Estimated remaining time: 13m 54s. Estimated total time: 15h 12m 46s. Time estimates for 10 more iterations: 9m 7s, 100 more iterations: 1h 31m 16s, 500 more iterations: 7h 36m 23s.
[2025-08-20 23:09:22,454][__main__][INFO] - Starting iteration 967.
[2025-08-20 23:09:45,523][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 23:09:45,525][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 23:09:45,532][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 23:09:47,991][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 23:09:47,993][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 23:09:47,999][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 23:09:48,002][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data.
[2025-08-20 23:09:48,003][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets.
[2025-08-20 23:09:48,301][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 23:09:49,089][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 23:09:49,880][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 23:09:50,671][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 23:09:51,464][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 23:09:52,255][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 23:09:53,050][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 23:09:53,843][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 23:09:54,636][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 23:09:55,429][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 23:09:56,223][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 23:09:57,015][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 23:09:57,809][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 23:09:58,604][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 23:09:59,398][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 23:10:00,191][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 23:10:00,987][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 23:10:01,782][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 23:10:02,575][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 23:10:03,783][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 23:10:04,575][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 23:10:05,366][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 23:10:06,159][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 23:10:06,950][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 23:10:07,744][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 23:10:08,536][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 23:10:09,328][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 23:10:10,125][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 23:10:10,918][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 23:10:11,710][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 23:10:12,503][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 23:10:13,297][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 23:10:14,896][mllm.training.trainer_common][INFO] - For task: Apply reinforce step
,
            ΔVRAM Allocated: 2.288818359375e-05 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:26,
            Percentage of VRAM taken: 69.91579512747062%,
            
[2025-08-20 23:10:15,883][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/policy_optimizer_state.pt
[2025-08-20 23:10:15,884][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/critic_optimizer_state.pt
[2025-08-20 23:10:17,235][__main__][INFO] - Iteration 968 took 54s (37.61% Gen, 62.39% Train). Generation: 20s, Training: 34s. Estimated remaining time: 13m 14s. Estimated total time: 15h 13m 0s. Time estimates for 10 more iterations: 9m 7s, 100 more iterations: 1h 31m 18s, 500 more iterations: 7h 36m 30s.
[2025-08-20 23:10:17,236][__main__][INFO] - Starting iteration 968.
[2025-08-20 23:10:40,474][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 23:10:40,475][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 23:10:40,482][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 23:10:42,924][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 23:10:42,925][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 23:10:42,932][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 23:10:42,934][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data.
[2025-08-20 23:10:42,934][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets.
[2025-08-20 23:10:43,233][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 23:10:44,025][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 23:10:44,817][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 23:10:45,612][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 23:10:46,407][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 23:10:47,201][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 23:10:47,995][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 23:10:48,792][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 23:10:49,587][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 23:10:50,381][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 23:10:51,175][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 23:10:51,972][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 23:10:52,767][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 23:10:53,562][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 23:10:54,357][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 23:10:55,152][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 23:10:55,949][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 23:10:56,743][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 23:10:57,538][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 23:10:58,332][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 23:10:59,129][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 23:10:59,923][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 23:11:00,716][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 23:11:01,999][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 23:11:02,792][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 23:11:03,589][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 23:11:04,385][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 23:11:05,179][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 23:11:05,972][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 23:11:06,767][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 23:11:07,562][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 23:11:08,356][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 23:11:09,926][mllm.training.trainer_common][INFO] - For task: Apply reinforce step
,
            ΔVRAM Allocated: 2.288818359375e-05 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:26,
            Percentage of VRAM taken: 69.91579512747062%,
            
[2025-08-20 23:11:10,848][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/policy_optimizer_state.pt
[2025-08-20 23:11:10,849][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/critic_optimizer_state.pt
[2025-08-20 23:11:12,119][__main__][INFO] - Iteration 969 took 54s (37.90% Gen, 62.10% Train). Generation: 20s, Training: 34s. Estimated remaining time: 14m 1s. Estimated total time: 15h 14m 42s. Time estimates for 10 more iterations: 9m 8s, 100 more iterations: 1h 31m 28s, 500 more iterations: 7h 37m 21s.
[2025-08-20 23:11:12,120][__main__][INFO] - Starting iteration 969.
[2025-08-20 23:11:35,161][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 23:11:35,162][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 23:11:35,169][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 23:11:37,631][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 23:11:37,632][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 23:11:37,638][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 23:11:37,640][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data.
[2025-08-20 23:11:37,641][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets.
[2025-08-20 23:11:37,940][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 23:11:38,733][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 23:11:39,524][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 23:11:40,317][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 23:11:41,113][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 23:11:41,906][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 23:11:42,700][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 23:11:43,496][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 23:11:44,292][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 23:11:45,085][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 23:11:45,879][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 23:11:46,676][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 23:11:47,472][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 23:11:48,266][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 23:11:49,060][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 23:11:49,855][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 23:11:50,653][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 23:11:51,449][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 23:11:52,244][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 23:11:53,587][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 23:11:54,381][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 23:11:55,174][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 23:11:55,970][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 23:11:56,764][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 23:11:57,558][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 23:11:58,353][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 23:11:59,150][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 23:11:59,946][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 23:12:00,740][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 23:12:01,534][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 23:12:02,328][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 23:12:03,122][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 23:12:04,694][mllm.training.trainer_common][INFO] - For task: Apply reinforce step
,
            ΔVRAM Allocated: 2.288818359375e-05 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:27,
            Percentage of VRAM taken: 69.91579512747062%,
            
[2025-08-20 23:12:05,602][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/policy_optimizer_state.pt
[2025-08-20 23:12:05,603][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/critic_optimizer_state.pt
[2025-08-20 23:12:06,998][__main__][INFO] - Iteration 970 took 54s (37.52% Gen, 62.48% Train). Generation: 20s, Training: 34s. Estimated remaining time: 13m 1s. Estimated total time: 15h 14m 37s. Time estimates for 10 more iterations: 9m 8s, 100 more iterations: 1h 31m 27s, 500 more iterations: 7h 37m 18s.
[2025-08-20 23:12:06,999][__main__][INFO] - Starting iteration 970.
[2025-08-20 23:12:29,777][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 23:12:29,778][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 23:12:29,784][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 23:12:32,249][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 23:12:32,250][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 23:12:32,257][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 23:12:32,259][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data.
[2025-08-20 23:12:32,260][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets.
[2025-08-20 23:12:32,563][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 23:12:33,356][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 23:12:34,146][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 23:12:34,939][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 23:12:35,733][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 23:12:36,525][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 23:12:37,318][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 23:12:38,112][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 23:12:38,904][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 23:12:39,699][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 23:12:40,494][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 23:12:41,287][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 23:12:42,080][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 23:12:43,312][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 23:12:44,106][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 23:12:44,900][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 23:12:45,697][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 23:12:46,493][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 23:12:47,286][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 23:12:48,078][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 23:12:48,874][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 23:12:49,668][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 23:12:50,460][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 23:12:51,255][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 23:12:52,050][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 23:12:52,844][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 23:12:53,637][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 23:12:54,433][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 23:12:55,227][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 23:12:56,021][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 23:12:56,815][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 23:12:57,612][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 23:12:59,227][mllm.training.trainer_common][INFO] - For task: Apply reinforce step
,
            ΔVRAM Allocated: 2.288818359375e-05 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:26,
            Percentage of VRAM taken: 69.91579512747062%,
            
[2025-08-20 23:13:00,164][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/policy_optimizer_state.pt
[2025-08-20 23:13:00,165][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/critic_optimizer_state.pt
[2025-08-20 23:13:01,579][__main__][INFO] - Iteration 971 took 54s (37.25% Gen, 62.75% Train). Generation: 20s, Training: 34s. Estimated remaining time: 7m 8s. Estimated total time: 15h 9m 39s. Time estimates for 10 more iterations: 9m 5s, 100 more iterations: 1h 30m 57s, 500 more iterations: 7h 34m 49s.
[2025-08-20 23:13:01,581][__main__][INFO] - Starting iteration 971.
[2025-08-20 23:13:25,308][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 23:13:25,310][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 23:13:25,316][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 23:13:27,808][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 23:13:27,809][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 23:13:27,816][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 23:13:27,819][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data.
[2025-08-20 23:13:27,819][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets.
[2025-08-20 23:13:28,123][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8950 tokens.
[2025-08-20 23:13:28,915][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8950 tokens.
[2025-08-20 23:13:29,706][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8950 tokens.
[2025-08-20 23:13:30,501][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8950 tokens.
[2025-08-20 23:13:31,296][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8950 tokens.
[2025-08-20 23:13:32,088][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8950 tokens.
[2025-08-20 23:13:32,881][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8950 tokens.
[2025-08-20 23:13:33,676][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8950 tokens.
[2025-08-20 23:13:34,470][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8950 tokens.
[2025-08-20 23:13:35,262][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8950 tokens.
[2025-08-20 23:13:36,059][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8950 tokens.
[2025-08-20 23:13:36,854][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8950 tokens.
[2025-08-20 23:13:37,647][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8950 tokens.
[2025-08-20 23:13:38,442][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8950 tokens.
[2025-08-20 23:13:39,239][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8950 tokens.
[2025-08-20 23:13:40,035][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8950 tokens.
[2025-08-20 23:13:40,829][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8950 tokens.
[2025-08-20 23:13:41,622][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8950 tokens.
[2025-08-20 23:13:42,417][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8950 tokens.
[2025-08-20 23:13:43,211][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8950 tokens.
[2025-08-20 23:13:44,462][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8950 tokens.
[2025-08-20 23:13:45,258][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8950 tokens.
[2025-08-20 23:13:46,052][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8950 tokens.
[2025-08-20 23:13:46,845][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8950 tokens.
[2025-08-20 23:13:47,639][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8950 tokens.
[2025-08-20 23:13:48,436][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8950 tokens.
[2025-08-20 23:13:49,231][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8950 tokens.
[2025-08-20 23:13:50,025][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8950 tokens.
[2025-08-20 23:13:50,818][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8950 tokens.
[2025-08-20 23:13:51,614][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8950 tokens.
[2025-08-20 23:13:52,408][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8950 tokens.
[2025-08-20 23:13:53,201][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8950 tokens.
[2025-08-20 23:13:54,790][mllm.training.trainer_common][INFO] - For task: Apply reinforce step
,
            ΔVRAM Allocated: 2.288818359375e-05 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:26,
            Percentage of VRAM taken: 69.91579512747062%,
            
[2025-08-20 23:13:55,771][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/policy_optimizer_state.pt
[2025-08-20 23:13:55,772][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/critic_optimizer_state.pt
[2025-08-20 23:13:57,135][__main__][INFO] - Iteration 972 took 55s (38.20% Gen, 61.79% Train). Generation: 21s, Training: 34s. Estimated remaining time: 22m 27s. Estimated total time: 15h 25m 53s. Time estimates for 10 more iterations: 9m 15s, 100 more iterations: 1h 32m 35s, 500 more iterations: 7h 42m 56s.
[2025-08-20 23:13:57,136][__main__][INFO] - Starting iteration 972.
[2025-08-20 23:14:19,781][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 23:14:19,783][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 23:14:19,789][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 23:14:22,259][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 23:14:22,261][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 23:14:22,267][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 23:14:22,269][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data.
[2025-08-20 23:14:22,270][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets.
[2025-08-20 23:14:22,569][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 23:14:23,361][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 23:14:24,153][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 23:14:24,946][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 23:14:25,740][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 23:14:26,531][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 23:14:27,325][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 23:14:28,120][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 23:14:28,913][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 23:14:29,705][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 23:14:30,501][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 23:14:31,294][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 23:14:32,087][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 23:14:32,882][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 23:14:33,677][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 23:14:34,470][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 23:14:35,266][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 23:14:36,061][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 23:14:36,854][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 23:14:37,646][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 23:14:38,440][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 23:14:39,233][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 23:14:40,025][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 23:14:41,336][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 23:14:42,128][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 23:14:42,921][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 23:14:43,717][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 23:14:44,509][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 23:14:45,302][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 23:14:46,098][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 23:14:46,893][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 23:14:47,686][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 23:14:49,308][mllm.training.trainer_common][INFO] - For task: Apply reinforce step
,
            ΔVRAM Allocated: 2.288818359375e-05 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:27,
            Percentage of VRAM taken: 69.91579512747062%,
            
[2025-08-20 23:14:50,231][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/policy_optimizer_state.pt
[2025-08-20 23:14:50,232][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/critic_optimizer_state.pt
[2025-08-20 23:14:51,464][__main__][INFO] - Iteration 973 took 54s (37.15% Gen, 62.84% Train). Generation: 20s, Training: 34s. Estimated remaining time: 1m 7s. Estimated total time: 15h 5m 27s. Time estimates for 10 more iterations: 9m 3s, 100 more iterations: 1h 30m 32s, 500 more iterations: 7h 32m 43s.
[2025-08-20 23:14:51,466][__main__][INFO] - Starting iteration 973.
[2025-08-20 23:15:14,660][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 23:15:14,662][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 23:15:14,668][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 23:15:17,128][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 23:15:17,129][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 23:15:17,135][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 23:15:17,138][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data.
[2025-08-20 23:15:17,138][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets.
[2025-08-20 23:15:17,437][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 23:15:18,228][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 23:15:19,023][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 23:15:19,815][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 23:15:20,610][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 23:15:21,406][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 23:15:22,200][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 23:15:22,993][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 23:15:23,788][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 23:15:24,585][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 23:15:25,380][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 23:15:26,174][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 23:15:26,969][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 23:15:27,766][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 23:15:28,564][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 23:15:29,360][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 23:15:30,153][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 23:15:30,947][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 23:15:31,743][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 23:15:32,537][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 23:15:33,330][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 23:15:34,125][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 23:15:34,922][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 23:15:36,140][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 23:15:36,934][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 23:15:37,727][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 23:15:38,521][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 23:15:39,316][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 23:15:40,110][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 23:15:40,904][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 23:15:41,701][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 23:15:42,495][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 23:15:44,115][mllm.training.trainer_common][INFO] - For task: Apply reinforce step
,
            ΔVRAM Allocated: 2.288818359375e-05 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:26,
            Percentage of VRAM taken: 69.91579512747062%,
            
[2025-08-20 23:15:45,060][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/policy_optimizer_state.pt
[2025-08-20 23:15:45,061][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/critic_optimizer_state.pt
[2025-08-20 23:15:46,356][__main__][INFO] - Iteration 974 took 54s (37.77% Gen, 62.23% Train). Generation: 20s, Training: 34s. Estimated remaining time: 9m 34s. Estimated total time: 15h 14m 49s. Time estimates for 10 more iterations: 9m 8s, 100 more iterations: 1h 31m 28s, 500 more iterations: 7h 37m 24s.
[2025-08-20 23:15:46,357][__main__][INFO] - Starting iteration 974.
[2025-08-20 23:16:10,070][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 23:16:10,071][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 23:16:10,078][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 23:16:12,559][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 23:16:12,561][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 23:16:12,568][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 23:16:12,570][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data.
[2025-08-20 23:16:12,570][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets.
[2025-08-20 23:16:12,869][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 23:16:13,660][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 23:16:14,453][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 23:16:15,250][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 23:16:16,044][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 23:16:16,837][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 23:16:17,632][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 23:16:18,427][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 23:16:19,221][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 23:16:20,015][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 23:16:20,811][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 23:16:21,607][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 23:16:22,402][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 23:16:23,195][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 23:16:23,990][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 23:16:24,786][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 23:16:25,580][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 23:16:26,372][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 23:16:27,167][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 23:16:27,961][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 23:16:28,754][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 23:16:29,549][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 23:16:30,346][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 23:16:31,141][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 23:16:31,934][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 23:16:32,728][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 23:16:33,987][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 23:16:34,783][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 23:16:35,578][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 23:16:36,371][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 23:16:37,166][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 23:16:37,962][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 23:16:39,540][mllm.training.trainer_common][INFO] - For task: Apply reinforce step
,
            ΔVRAM Allocated: 2.288818359375e-05 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:26,
            Percentage of VRAM taken: 69.91579512747062%,
            
[2025-08-20 23:16:40,421][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/policy_optimizer_state.pt
[2025-08-20 23:16:40,423][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/critic_optimizer_state.pt
[2025-08-20 23:16:41,846][__main__][INFO] - Iteration 975 took 55s (38.27% Gen, 61.73% Train). Generation: 21s, Training: 34s. Estimated remaining time: 18m 37s. Estimated total time: 15h 24m 48s. Time estimates for 10 more iterations: 9m 14s, 100 more iterations: 1h 32m 28s, 500 more iterations: 7h 42m 24s.
[2025-08-20 23:16:41,847][__main__][INFO] - Starting iteration 975.
[2025-08-20 23:17:04,564][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 23:17:04,565][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 23:17:04,571][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 23:17:07,031][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 23:17:07,032][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 23:17:07,039][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 23:17:07,041][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data.
[2025-08-20 23:17:07,042][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets.
[2025-08-20 23:17:07,340][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 23:17:08,132][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 23:17:08,924][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 23:17:09,716][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 23:17:10,513][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 23:17:11,307][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 23:17:12,100][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 23:17:12,892][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 23:17:13,687][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 23:17:14,480][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 23:17:15,274][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 23:17:16,069][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 23:17:16,862][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 23:17:17,655][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 23:17:18,452][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 23:17:19,246][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 23:17:20,038][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 23:17:20,831][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 23:17:21,625][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 23:17:22,417][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 23:17:23,211][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 23:17:24,006][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 23:17:25,398][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 23:17:26,188][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 23:17:26,980][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 23:17:27,769][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 23:17:28,563][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 23:17:29,355][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 23:17:30,147][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 23:17:30,939][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 23:17:31,730][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 23:17:32,525][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 23:17:34,096][mllm.training.trainer_common][INFO] - For task: Apply reinforce step
,
            ΔVRAM Allocated: 2.288818359375e-05 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:27,
            Percentage of VRAM taken: 69.91579512747062%,
            
[2025-08-20 23:17:35,337][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/policy_optimizer_state.pt
[2025-08-20 23:17:35,339][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/critic_optimizer_state.pt
[2025-08-20 23:17:36,606][__main__][INFO] - Iteration 976 took 54s (36.99% Gen, 63.01% Train). Generation: 20s, Training: 34s. Estimated remaining time: 5m 33s. Estimated total time: 15h 12m 38s. Time estimates for 10 more iterations: 9m 7s, 100 more iterations: 1h 31m 15s, 500 more iterations: 7h 36m 19s.
[2025-08-20 23:17:36,608][__main__][INFO] - Starting iteration 976.
[2025-08-20 23:17:59,422][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 23:17:59,424][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 23:17:59,430][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 23:18:01,885][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 23:18:01,886][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 23:18:01,893][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 23:18:01,895][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data.
[2025-08-20 23:18:01,895][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets.
[2025-08-20 23:18:02,193][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 23:18:02,984][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 23:18:03,777][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 23:18:04,572][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 23:18:05,364][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 23:18:06,158][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 23:18:06,953][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 23:18:07,745][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 23:18:08,539][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 23:18:09,334][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 23:18:10,128][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 23:18:10,921][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 23:18:11,715][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 23:18:12,512][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 23:18:13,304][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 23:18:14,096][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 23:18:15,314][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 23:18:16,107][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 23:18:16,898][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 23:18:17,691][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 23:18:18,484][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 23:18:19,276][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 23:18:20,072][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 23:18:20,866][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 23:18:21,659][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 23:18:22,452][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 23:18:23,248][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 23:18:24,041][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 23:18:24,834][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 23:18:25,630][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 23:18:26,424][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 23:18:27,217][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 23:18:28,890][mllm.training.trainer_common][INFO] - For task: Apply reinforce step
,
            ΔVRAM Allocated: 2.288818359375e-05 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:26,
            Percentage of VRAM taken: 69.91579512747062%,
            
[2025-08-20 23:18:29,801][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/policy_optimizer_state.pt
[2025-08-20 23:18:29,802][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/critic_optimizer_state.pt
[2025-08-20 23:18:31,071][__main__][INFO] - Iteration 977 took 54s (37.40% Gen, 62.59% Train). Generation: 20s, Training: 34s. Estimated remaining time: -17s. Estimated total time: 15h 7m 42s. Time estimates for 10 more iterations: 9m 4s, 100 more iterations: 1h 30m 46s, 500 more iterations: 7h 33m 51s.
[2025-08-20 23:18:31,074][__main__][INFO] - Starting iteration 977.
[2025-08-20 23:18:54,072][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 23:18:54,073][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 23:18:54,080][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 23:18:56,538][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 23:18:56,539][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 23:18:56,546][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 23:18:56,548][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data.
[2025-08-20 23:18:56,549][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets.
[2025-08-20 23:18:56,849][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 23:18:57,640][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 23:18:58,434][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 23:18:59,226][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 23:19:00,020][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 23:19:00,816][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 23:19:01,609][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 23:19:02,402][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 23:19:03,199][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 23:19:03,995][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 23:19:04,790][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 23:19:05,583][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 23:19:06,379][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 23:19:07,176][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 23:19:07,970][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 23:19:08,762][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 23:19:09,557][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 23:19:10,350][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 23:19:11,142][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 23:19:11,936][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 23:19:12,732][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 23:19:14,051][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 23:19:14,845][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 23:19:15,638][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 23:19:16,434][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 23:19:17,229][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 23:19:18,022][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 23:19:18,816][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 23:19:19,613][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 23:19:20,407][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 23:19:21,200][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 23:19:21,995][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 23:19:23,586][mllm.training.trainer_common][INFO] - For task: Apply reinforce step
,
            ΔVRAM Allocated: 2.288818359375e-05 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:27,
            Percentage of VRAM taken: 69.91579512747062%,
            
[2025-08-20 23:19:24,512][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/policy_optimizer_state.pt
[2025-08-20 23:19:24,514][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/critic_optimizer_state.pt
[2025-08-20 23:19:25,820][__main__][INFO] - Iteration 978 took 54s (37.54% Gen, 62.46% Train). Generation: 20s, Training: 34s. Estimated remaining time: 3m 30s. Estimated total time: 15h 12m 25s. Time estimates for 10 more iterations: 9m 7s, 100 more iterations: 1h 31m 14s, 500 more iterations: 7h 36m 12s.
[2025-08-20 23:19:25,821][__main__][INFO] - Starting iteration 978.
[2025-08-20 23:19:49,002][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 23:19:49,004][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 23:19:49,010][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 23:19:51,487][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 23:19:51,489][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 23:19:51,495][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 23:19:51,497][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data.
[2025-08-20 23:19:51,498][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets.
[2025-08-20 23:19:51,796][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 23:19:52,587][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 23:19:53,381][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 23:19:54,174][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 23:19:54,967][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 23:19:55,763][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 23:19:56,560][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 23:19:57,354][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 23:19:58,148][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 23:19:58,943][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 23:19:59,741][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 23:20:00,537][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 23:20:01,331][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 23:20:02,123][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 23:20:02,919][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 23:20:03,713][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 23:20:04,505][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 23:20:05,300][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 23:20:06,095][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 23:20:06,888][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 23:20:07,681][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 23:20:08,977][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 23:20:09,772][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 23:20:10,566][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 23:20:11,360][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 23:20:12,157][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 23:20:12,952][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 23:20:13,746][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 23:20:14,540][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 23:20:15,337][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 23:20:16,135][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 23:20:16,930][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 23:20:18,545][mllm.training.trainer_common][INFO] - For task: Apply reinforce step
,
            ΔVRAM Allocated: 2.288818359375e-05 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:27,
            Percentage of VRAM taken: 69.91579512747062%,
            
[2025-08-20 23:20:19,472][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/policy_optimizer_state.pt
[2025-08-20 23:20:19,474][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/critic_optimizer_state.pt
[2025-08-20 23:20:20,705][__main__][INFO] - Iteration 979 took 54s (37.76% Gen, 62.23% Train). Generation: 20s, Training: 34s. Estimated remaining time: 4m 54s. Estimated total time: 15h 14m 43s. Time estimates for 10 more iterations: 9m 8s, 100 more iterations: 1h 31m 28s, 500 more iterations: 7h 37m 21s.
[2025-08-20 23:20:20,707][__main__][INFO] - Starting iteration 979.
[2025-08-20 23:20:43,412][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 23:20:43,414][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 23:20:43,420][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 23:20:45,887][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 23:20:45,888][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 23:20:45,895][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 23:20:45,897][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data.
[2025-08-20 23:20:45,898][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets.
[2025-08-20 23:20:46,196][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 23:20:46,985][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 23:20:47,775][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 23:20:48,566][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 23:20:49,360][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 23:20:50,151][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 23:20:50,943][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 23:20:51,735][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 23:20:52,527][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 23:20:53,323][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 23:20:54,117][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 23:20:54,909][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 23:20:55,703][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 23:20:56,495][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 23:20:57,286][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 23:20:58,081][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 23:20:58,873][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 23:20:59,664][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 23:21:00,458][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 23:21:01,688][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 23:21:02,483][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 23:21:03,278][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 23:21:04,071][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 23:21:04,864][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 23:21:05,660][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 23:21:06,455][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 23:21:07,248][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 23:21:08,042][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 23:21:08,839][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 23:21:09,635][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 23:21:10,428][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 23:21:11,221][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 23:21:12,836][mllm.training.trainer_common][INFO] - For task: Apply reinforce step
,
            ΔVRAM Allocated: 2.288818359375e-05 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:26,
            Percentage of VRAM taken: 69.91579512747062%,
            
[2025-08-20 23:21:13,789][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/policy_optimizer_state.pt
[2025-08-20 23:21:13,791][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/critic_optimizer_state.pt
[2025-08-20 23:21:15,098][__main__][INFO] - Iteration 980 took 54s (37.22% Gen, 62.78% Train). Generation: 20s, Training: 34s. Estimated remaining time: -253s. Estimated total time: 15h 6m 30s. Time estimates for 10 more iterations: 9m 3s, 100 more iterations: 1h 30m 39s, 500 more iterations: 7h 33m 15s.
[2025-08-20 23:21:15,099][__main__][INFO] - Starting iteration 980.
[2025-08-20 23:21:38,335][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 23:21:38,337][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 23:21:38,343][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 23:21:40,800][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 23:21:40,801][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 23:21:40,808][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 23:21:40,810][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data.
[2025-08-20 23:21:40,811][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets.
[2025-08-20 23:21:41,110][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 23:21:41,904][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 23:21:42,696][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 23:21:43,490][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 23:21:44,287][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 23:21:45,081][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 23:21:45,875][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 23:21:46,669][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 23:21:47,467][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 23:21:48,263][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 23:21:49,057][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 23:21:49,851][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 23:21:50,647][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 23:21:51,442][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 23:21:52,234][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 23:21:53,027][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 23:21:53,823][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 23:21:54,616][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 23:21:55,409][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 23:21:56,205][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 23:21:57,001][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 23:21:58,241][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 23:21:59,034][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 23:21:59,826][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 23:22:00,622][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 23:22:01,415][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 23:22:02,208][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 23:22:03,003][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 23:22:03,797][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 23:22:04,589][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 23:22:05,383][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 23:22:06,180][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 23:22:07,789][mllm.training.trainer_common][INFO] - For task: Apply reinforce step
,
            ΔVRAM Allocated: 2.288818359375e-05 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:26,
            Percentage of VRAM taken: 69.91579512747062%,
            
[2025-08-20 23:22:08,729][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/policy_optimizer_state.pt
[2025-08-20 23:22:08,730][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/critic_optimizer_state.pt
[2025-08-20 23:22:10,026][__main__][INFO] - Iteration 981 took 54s (37.85% Gen, 62.15% Train). Generation: 20s, Training: 34s. Estimated remaining time: 3m 47s. Estimated total time: 15h 15m 26s. Time estimates for 10 more iterations: 9m 9s, 100 more iterations: 1h 31m 32s, 500 more iterations: 7h 37m 43s.
[2025-08-20 23:22:10,028][__main__][INFO] - Starting iteration 981.
[2025-08-20 23:22:33,218][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 23:22:33,219][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 23:22:33,226][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 23:22:35,678][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 23:22:35,680][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 23:22:35,686][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 23:22:35,689][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data.
[2025-08-20 23:22:35,689][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets.
[2025-08-20 23:22:35,988][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 23:22:36,777][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 23:22:37,572][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 23:22:38,368][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 23:22:39,160][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 23:22:39,953][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 23:22:40,750][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 23:22:41,544][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 23:22:42,338][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 23:22:43,133][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 23:22:43,930][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 23:22:44,725][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 23:22:45,517][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 23:22:46,310][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 23:22:47,105][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 23:22:47,897][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 23:22:48,690][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 23:22:49,485][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 23:22:50,279][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 23:22:51,071][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 23:22:51,867][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 23:22:52,663][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 23:22:53,456][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 23:22:54,249][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 23:22:55,638][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 23:22:56,431][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 23:22:57,227][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 23:22:58,023][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 23:22:58,817][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 23:22:59,611][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 23:23:00,406][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 23:23:01,203][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 23:23:02,786][mllm.training.trainer_common][INFO] - For task: Apply reinforce step
,
            ΔVRAM Allocated: 2.288818359375e-05 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:27,
            Percentage of VRAM taken: 69.91579512747062%,
            
[2025-08-20 23:23:03,716][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/policy_optimizer_state.pt
[2025-08-20 23:23:03,717][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/critic_optimizer_state.pt
[2025-08-20 23:23:04,924][__main__][INFO] - Iteration 982 took 54s (37.80% Gen, 62.20% Train). Generation: 20s, Training: 34s. Estimated remaining time: 2m 21s. Estimated total time: 15h 14m 55s. Time estimates for 10 more iterations: 9m 8s, 100 more iterations: 1h 31m 29s, 500 more iterations: 7h 37m 27s.
[2025-08-20 23:23:04,925][__main__][INFO] - Starting iteration 982.
[2025-08-20 23:23:27,720][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 23:23:27,722][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 23:23:27,728][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 23:23:30,203][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 23:23:30,205][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 23:23:30,211][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 23:23:30,213][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data.
[2025-08-20 23:23:30,214][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets.
[2025-08-20 23:23:30,519][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 23:23:31,311][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 23:23:32,102][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 23:23:32,895][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 23:23:33,690][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 23:23:34,482][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 23:23:35,276][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 23:23:36,073][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 23:23:36,867][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 23:23:37,659][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 23:23:38,455][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 23:23:39,251][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 23:23:40,045][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 23:23:40,837][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 23:23:41,631][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 23:23:42,424][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 23:23:43,216][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 23:23:44,011][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 23:23:44,805][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 23:23:45,598][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 23:23:46,391][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 23:23:47,688][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 23:23:48,481][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 23:23:49,273][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 23:23:50,069][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 23:23:50,863][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 23:23:51,656][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 23:23:52,450][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 23:23:53,247][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 23:23:54,041][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 23:23:54,834][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 23:23:55,629][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 23:23:57,225][mllm.training.trainer_common][INFO] - For task: Apply reinforce step
,
            ΔVRAM Allocated: 2.288818359375e-05 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:26,
            Percentage of VRAM taken: 69.91579512747062%,
            
[2025-08-20 23:23:58,121][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/policy_optimizer_state.pt
[2025-08-20 23:23:58,122][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/critic_optimizer_state.pt
[2025-08-20 23:23:59,461][__main__][INFO] - Iteration 983 took 54s (37.28% Gen, 62.72% Train). Generation: 20s, Training: 34s. Estimated remaining time: -273s. Estimated total time: 15h 8m 54s. Time estimates for 10 more iterations: 9m 5s, 100 more iterations: 1h 30m 53s, 500 more iterations: 7h 34m 27s.
[2025-08-20 23:23:59,462][__main__][INFO] - Starting iteration 983.
[2025-08-20 23:24:22,073][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 23:24:22,075][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 23:24:22,081][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 23:24:24,542][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 23:24:24,543][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 23:24:24,549][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 23:24:24,551][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data.
[2025-08-20 23:24:24,552][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets.
[2025-08-20 23:24:24,852][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 23:24:25,641][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 23:24:26,435][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 23:24:27,227][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 23:24:28,018][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 23:24:28,813][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 23:24:29,607][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 23:24:30,399][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 23:24:31,196][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 23:24:31,990][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 23:24:32,783][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 23:24:33,578][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 23:24:34,373][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 23:24:35,647][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 23:24:36,437][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 23:24:37,232][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 23:24:38,022][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 23:24:38,814][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 23:24:39,608][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 23:24:40,400][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 23:24:41,193][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 23:24:41,987][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 23:24:42,778][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 23:24:43,572][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 23:24:44,365][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 23:24:45,158][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 23:24:45,951][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 23:24:46,745][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 23:24:47,537][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 23:24:48,331][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 23:24:49,127][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 23:24:49,920][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 23:24:51,494][mllm.training.trainer_common][INFO] - For task: Apply reinforce step
,
            ΔVRAM Allocated: 2.288818359375e-05 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:26,
            Percentage of VRAM taken: 69.91579512747062%,
            
[2025-08-20 23:24:52,418][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/policy_optimizer_state.pt
[2025-08-20 23:24:52,420][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/critic_optimizer_state.pt
[2025-08-20 23:24:53,698][__main__][INFO] - Iteration 984 took 54s (37.19% Gen, 62.81% Train). Generation: 20s, Training: 34s. Estimated remaining time: -627s. Estimated total time: 15h 3m 55s. Time estimates for 10 more iterations: 9m 2s, 100 more iterations: 1h 30m 23s, 500 more iterations: 7h 31m 57s.
[2025-08-20 23:24:53,699][__main__][INFO] - Starting iteration 984.
[2025-08-20 23:25:16,378][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 23:25:16,379][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 23:25:16,386][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 23:25:18,878][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 23:25:18,879][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 23:25:18,885][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 23:25:18,888][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data.
[2025-08-20 23:25:18,888][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets.
[2025-08-20 23:25:19,187][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 23:25:19,979][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 23:25:20,770][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 23:25:21,563][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 23:25:22,358][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 23:25:23,151][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 23:25:23,944][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 23:25:24,739][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 23:25:25,534][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 23:25:26,326][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 23:25:27,121][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 23:25:27,917][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 23:25:28,710][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 23:25:29,502][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 23:25:30,297][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 23:25:31,089][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 23:25:31,881][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 23:25:32,676][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 23:25:33,470][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 23:25:34,263][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 23:25:35,058][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 23:25:36,346][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 23:25:37,140][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 23:25:37,936][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 23:25:38,731][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 23:25:39,523][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 23:25:40,316][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 23:25:41,113][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 23:25:41,908][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 23:25:42,701][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 23:25:43,495][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 23:25:44,291][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 23:25:45,943][mllm.training.trainer_common][INFO] - For task: Apply reinforce step
,
            ΔVRAM Allocated: 2.288818359375e-05 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:27,
            Percentage of VRAM taken: 69.91579512747062%,
            
[2025-08-20 23:25:46,870][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/policy_optimizer_state.pt
[2025-08-20 23:25:46,871][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/critic_optimizer_state.pt
[2025-08-20 23:25:48,095][__main__][INFO] - Iteration 985 took 54s (37.16% Gen, 62.84% Train). Generation: 20s, Training: 34s. Estimated remaining time: -521s. Estimated total time: 15h 6m 35s. Time estimates for 10 more iterations: 9m 3s, 100 more iterations: 1h 30m 39s, 500 more iterations: 7h 33m 17s.
[2025-08-20 23:25:48,096][__main__][INFO] - Starting iteration 985.
[2025-08-20 23:26:10,842][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 23:26:10,843][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 23:26:10,850][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 23:26:13,326][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 23:26:13,328][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 23:26:13,334][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 23:26:13,337][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data.
[2025-08-20 23:26:13,338][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets.
[2025-08-20 23:26:13,637][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 23:26:14,427][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 23:26:15,222][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 23:26:16,015][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 23:26:16,807][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 23:26:17,603][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 23:26:18,398][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 23:26:19,192][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 23:26:19,985][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 23:26:20,782][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 23:26:21,577][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 23:26:22,370][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 23:26:23,163][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 23:26:23,957][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 23:26:24,749][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 23:26:25,542][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 23:26:26,337][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 23:26:27,130][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 23:26:28,346][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 23:26:29,137][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 23:26:29,928][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 23:26:30,719][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 23:26:31,512][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 23:26:32,303][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 23:26:33,097][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 23:26:33,889][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 23:26:34,680][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 23:26:35,474][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 23:26:36,265][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 23:26:37,057][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 23:26:37,851][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 23:26:38,643][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 23:26:40,222][mllm.training.trainer_common][INFO] - For task: Apply reinforce step
,
            ΔVRAM Allocated: 2.288818359375e-05 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:26,
            Percentage of VRAM taken: 69.91579512747062%,
            
[2025-08-20 23:26:41,166][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/policy_optimizer_state.pt
[2025-08-20 23:26:41,167][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/critic_optimizer_state.pt
[2025-08-20 23:26:42,459][__main__][INFO] - Iteration 986 took 54s (37.33% Gen, 62.67% Train). Generation: 20s, Training: 34s. Estimated remaining time: -608s. Estimated total time: 15h 6m 2s. Time estimates for 10 more iterations: 9m 3s, 100 more iterations: 1h 30m 36s, 500 more iterations: 7h 33m 1s.
[2025-08-20 23:26:42,460][__main__][INFO] - Starting iteration 986.
[2025-08-20 23:27:05,714][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 23:27:05,715][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 23:27:05,722][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 23:27:08,172][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 23:27:08,173][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 23:27:08,180][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 23:27:08,182][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data.
[2025-08-20 23:27:08,182][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets.
[2025-08-20 23:27:08,491][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 23:27:09,282][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 23:27:10,071][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 23:27:10,864][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 23:27:11,655][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 23:27:12,445][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 23:27:13,239][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 23:27:14,029][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 23:27:14,824][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 23:27:15,617][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 23:27:16,408][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 23:27:17,203][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 23:27:17,995][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 23:27:18,787][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 23:27:19,582][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 23:27:20,375][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 23:27:21,166][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 23:27:21,962][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 23:27:23,232][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 23:27:24,024][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 23:27:24,820][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 23:27:25,612][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 23:27:26,404][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 23:27:27,200][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 23:27:27,994][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 23:27:28,786][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 23:27:29,581][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 23:27:30,376][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 23:27:31,170][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 23:27:31,962][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 23:27:32,759][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 23:27:33,552][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 23:27:35,095][mllm.training.trainer_common][INFO] - For task: Apply reinforce step
,
            ΔVRAM Allocated: 2.288818359375e-05 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:26,
            Percentage of VRAM taken: 69.91579512747062%,
            
[2025-08-20 23:27:36,017][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/policy_optimizer_state.pt
[2025-08-20 23:27:36,019][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/critic_optimizer_state.pt
[2025-08-20 23:27:37,289][__main__][INFO] - Iteration 987 took 54s (37.96% Gen, 62.04% Train). Generation: 20s, Training: 34s. Estimated remaining time: -197s. Estimated total time: 15h 13m 48s. Time estimates for 10 more iterations: 9m 8s, 100 more iterations: 1h 31m 22s, 500 more iterations: 7h 36m 54s.
[2025-08-20 23:27:37,291][__main__][INFO] - Starting iteration 987.
[2025-08-20 23:28:00,411][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 23:28:00,412][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 23:28:00,419][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 23:28:02,873][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 23:28:02,874][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 23:28:02,881][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 23:28:02,883][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data.
[2025-08-20 23:28:02,883][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets.
[2025-08-20 23:28:03,186][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 23:28:03,974][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 23:28:04,768][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 23:28:05,559][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 23:28:06,350][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 23:28:07,145][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 23:28:07,936][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 23:28:08,727][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 23:28:09,520][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 23:28:10,312][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 23:28:11,106][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 23:28:11,899][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 23:28:12,691][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 23:28:13,486][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 23:28:14,281][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 23:28:15,073][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 23:28:15,867][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 23:28:16,663][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 23:28:17,455][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 23:28:18,248][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 23:28:19,045][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 23:28:19,840][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 23:28:21,210][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 23:28:22,005][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 23:28:22,802][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 23:28:23,597][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 23:28:24,391][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 23:28:25,185][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 23:28:25,983][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 23:28:26,778][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 23:28:27,572][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 23:28:28,366][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 23:28:30,022][mllm.training.trainer_common][INFO] - For task: Apply reinforce step
,
            ΔVRAM Allocated: 2.288818359375e-05 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:27,
            Percentage of VRAM taken: 69.91579512747062%,
            
[2025-08-20 23:28:31,216][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/policy_optimizer_state.pt
[2025-08-20 23:28:31,218][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/critic_optimizer_state.pt
[2025-08-20 23:28:32,503][__main__][INFO] - Iteration 988 took 55s (37.45% Gen, 62.55% Train). Generation: 20s, Training: 34s. Estimated remaining time: 2m 9s. Estimated total time: 15h 20m 11s. Time estimates for 10 more iterations: 9m 12s, 100 more iterations: 1h 32m 1s, 500 more iterations: 7h 40m 5s.
[2025-08-20 23:28:32,504][__main__][INFO] - Starting iteration 988.
[2025-08-20 23:28:55,554][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 23:28:55,555][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 23:28:55,561][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 23:28:58,051][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 23:28:58,053][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 23:28:58,059][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 23:28:58,062][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data.
[2025-08-20 23:28:58,062][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets.
[2025-08-20 23:28:58,372][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 23:28:59,164][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 23:28:59,955][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 23:29:00,750][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 23:29:01,543][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 23:29:02,335][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 23:29:03,131][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 23:29:03,926][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 23:29:04,718][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 23:29:05,512][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 23:29:06,308][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 23:29:07,104][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 23:29:07,897][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 23:29:08,691][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 23:29:09,490][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 23:29:10,285][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 23:29:11,080][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 23:29:11,873][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 23:29:12,670][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 23:29:13,918][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 23:29:14,712][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 23:29:15,508][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 23:29:16,304][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 23:29:17,098][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 23:29:17,891][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 23:29:18,688][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 23:29:19,486][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 23:29:20,281][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 23:29:21,075][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 23:29:21,869][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 23:29:22,668][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 23:29:23,463][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 23:29:25,050][mllm.training.trainer_common][INFO] - For task: Apply reinforce step
,
            ΔVRAM Allocated: 2.288818359375e-05 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:26,
            Percentage of VRAM taken: 69.91579512747062%,
            
[2025-08-20 23:29:25,924][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/policy_optimizer_state.pt
[2025-08-20 23:29:25,925][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/critic_optimizer_state.pt
[2025-08-20 23:29:27,262][__main__][INFO] - Iteration 989 took 54s (37.61% Gen, 62.39% Train). Generation: 20s, Training: 34s. Estimated remaining time: -378s. Estimated total time: 15h 12m 37s. Time estimates for 10 more iterations: 9m 7s, 100 more iterations: 1h 31m 15s, 500 more iterations: 7h 36m 18s.
[2025-08-20 23:29:27,263][__main__][INFO] - Starting iteration 989.
[2025-08-20 23:29:50,027][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 23:29:50,028][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 23:29:50,034][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 23:29:52,489][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 23:29:52,491][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 23:29:52,497][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 23:29:52,500][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data.
[2025-08-20 23:29:52,500][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets.
[2025-08-20 23:29:52,800][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 23:29:53,592][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 23:29:54,385][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 23:29:55,177][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 23:29:55,973][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 23:29:56,767][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 23:29:57,560][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 23:29:58,356][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 23:29:59,152][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 23:29:59,946][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 23:30:00,739][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 23:30:01,531][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 23:30:02,325][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 23:30:03,119][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 23:30:03,913][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 23:30:04,709][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 23:30:05,502][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 23:30:06,296][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 23:30:07,686][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 23:30:08,479][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 23:30:09,273][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 23:30:10,070][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 23:30:10,866][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 23:30:11,660][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 23:30:12,455][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 23:30:13,250][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 23:30:14,047][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 23:30:14,841][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 23:30:15,634][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 23:30:16,429][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 23:30:17,226][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 23:30:18,020][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 23:30:19,634][mllm.training.trainer_common][INFO] - For task: Apply reinforce step
,
            ΔVRAM Allocated: 2.288818359375e-05 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:27,
            Percentage of VRAM taken: 69.91579512747062%,
            
[2025-08-20 23:30:20,595][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/policy_optimizer_state.pt
[2025-08-20 23:30:20,596][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/critic_optimizer_state.pt
[2025-08-20 23:30:21,963][__main__][INFO] - Iteration 990 took 54s (37.13% Gen, 62.87% Train). Generation: 20s, Training: 34s. Estimated remaining time: -491s. Estimated total time: 15h 11m 39s. Time estimates for 10 more iterations: 9m 6s, 100 more iterations: 1h 31m 9s, 500 more iterations: 7h 35m 49s.
[2025-08-20 23:30:21,965][__main__][INFO] - Starting iteration 990.
[2025-08-20 23:30:44,667][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 23:30:44,668][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 23:30:44,675][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 23:30:47,156][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 23:30:47,157][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 23:30:47,163][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 23:30:47,166][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data.
[2025-08-20 23:30:47,166][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets.
[2025-08-20 23:30:47,472][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 23:30:48,262][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 23:30:49,057][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 23:30:49,851][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 23:30:50,643][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 23:30:51,439][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 23:30:52,235][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 23:30:53,029][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 23:30:53,821][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 23:30:54,617][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 23:30:55,412][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 23:30:56,205][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 23:30:56,997][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 23:30:57,793][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 23:30:58,586][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 23:30:59,379][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 23:31:00,173][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 23:31:00,967][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 23:31:01,759][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 23:31:02,554][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 23:31:03,919][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 23:31:04,714][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 23:31:05,510][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 23:31:06,303][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 23:31:07,096][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 23:31:07,892][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 23:31:08,688][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 23:31:09,482][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 23:31:10,276][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 23:31:11,072][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 23:31:11,868][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 23:31:12,661][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 23:31:14,300][mllm.training.trainer_common][INFO] - For task: Apply reinforce step
,
            ΔVRAM Allocated: 2.288818359375e-05 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:27,
            Percentage of VRAM taken: 69.91579512747062%,
            
[2025-08-20 23:31:15,219][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/policy_optimizer_state.pt
[2025-08-20 23:31:15,220][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/critic_optimizer_state.pt
[2025-08-20 23:31:16,617][__main__][INFO] - Iteration 991 took 54s (37.04% Gen, 62.95% Train). Generation: 20s, Training: 34s. Estimated remaining time: -593s. Estimated total time: 15h 10m 51s. Time estimates for 10 more iterations: 9m 6s, 100 more iterations: 1h 31m 5s, 500 more iterations: 7h 35m 25s.
[2025-08-20 23:31:16,618][__main__][INFO] - Starting iteration 991.
[2025-08-20 23:31:39,797][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 23:31:39,798][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 23:31:39,804][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 23:31:42,281][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 23:31:42,282][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 23:31:42,289][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 23:31:42,291][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data.
[2025-08-20 23:31:42,292][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets.
[2025-08-20 23:31:42,590][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 23:31:43,378][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 23:31:44,170][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 23:31:44,962][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 23:31:45,757][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 23:31:46,548][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 23:31:47,341][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 23:31:48,136][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 23:31:48,928][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 23:31:49,721][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 23:31:50,516][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 23:31:51,310][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 23:31:52,103][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 23:31:52,898][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 23:31:53,693][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 23:31:54,485][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 23:31:55,279][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 23:31:56,496][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 23:31:57,287][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 23:31:58,077][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 23:31:58,872][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 23:31:59,663][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 23:32:00,456][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 23:32:01,248][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 23:32:02,039][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 23:32:02,833][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 23:32:03,626][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 23:32:04,418][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 23:32:05,213][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 23:32:06,005][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 23:32:06,796][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 23:32:07,592][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 23:32:09,178][mllm.training.trainer_common][INFO] - For task: Apply reinforce step
,
            ΔVRAM Allocated: 2.288818359375e-05 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:26,
            Percentage of VRAM taken: 69.91579512747062%,
            
[2025-08-20 23:32:10,118][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/policy_optimizer_state.pt
[2025-08-20 23:32:10,120][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/critic_optimizer_state.pt
[2025-08-20 23:32:11,477][__main__][INFO] - Iteration 992 took 54s (37.73% Gen, 62.27% Train). Generation: 20s, Training: 34s. Estimated remaining time: -442s. Estimated total time: 15h 14m 18s. Time estimates for 10 more iterations: 9m 8s, 100 more iterations: 1h 31m 25s, 500 more iterations: 7h 37m 9s.
[2025-08-20 23:32:11,478][__main__][INFO] - Starting iteration 992.
[2025-08-20 23:32:35,447][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 23:32:35,449][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 23:32:35,455][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 23:32:37,919][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 23:32:37,921][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 23:32:37,927][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 23:32:37,929][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data.
[2025-08-20 23:32:37,930][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets.
[2025-08-20 23:32:38,231][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 23:32:39,022][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 23:32:39,814][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 23:32:40,605][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 23:32:41,400][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 23:32:42,192][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 23:32:42,985][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 23:32:43,781][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 23:32:44,574][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 23:32:45,367][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 23:32:46,162][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 23:32:46,956][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 23:32:47,749][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 23:32:48,543][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 23:32:49,337][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 23:32:50,130][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 23:32:50,923][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 23:32:51,720][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 23:32:52,515][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 23:32:53,886][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 23:32:54,681][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 23:32:55,476][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 23:32:56,269][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 23:32:57,062][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 23:32:57,858][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 23:32:58,653][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 23:32:59,446][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 23:33:00,240][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 23:33:01,037][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 23:33:01,832][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 23:33:02,625][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 23:33:03,419][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 23:33:05,014][mllm.training.trainer_common][INFO] - For task: Apply reinforce step
,
            ΔVRAM Allocated: 2.288818359375e-05 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:27,
            Percentage of VRAM taken: 69.91579512747062%,
            
[2025-08-20 23:33:05,919][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/policy_optimizer_state.pt
[2025-08-20 23:33:05,920][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/critic_optimizer_state.pt
[2025-08-20 23:33:07,233][__main__][INFO] - Iteration 993 took 55s (38.56% Gen, 61.43% Train). Generation: 21s, Training: 34s. Estimated remaining time: 6m 37s. Estimated total time: 15h 29m 13s. Time estimates for 10 more iterations: 9m 17s, 100 more iterations: 1h 32m 55s, 500 more iterations: 7h 44m 36s.
[2025-08-20 23:33:07,234][__main__][INFO] - Starting iteration 993.
[2025-08-20 23:33:30,439][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 23:33:30,440][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 23:33:30,447][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 23:33:32,916][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 23:33:32,918][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 23:33:32,924][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 23:33:32,926][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data.
[2025-08-20 23:33:32,927][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets.
[2025-08-20 23:33:33,225][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 23:33:34,015][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 23:33:34,807][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 23:33:35,601][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 23:33:36,394][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 23:33:37,188][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 23:33:37,984][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 23:33:38,779][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 23:33:39,572][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 23:33:40,366][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 23:33:41,163][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 23:33:41,957][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 23:33:42,751][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 23:33:43,546][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 23:33:44,340][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 23:33:45,135][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 23:33:45,928][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 23:33:46,725][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 23:33:47,522][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 23:33:48,317][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 23:33:49,111][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 23:33:50,451][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 23:33:51,245][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 23:33:52,042][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 23:33:52,838][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 23:33:53,631][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 23:33:54,426][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 23:33:55,223][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 23:33:56,020][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 23:33:56,817][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 23:33:57,613][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 23:33:58,407][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 23:34:00,056][mllm.training.trainer_common][INFO] - For task: Apply reinforce step
,
            ΔVRAM Allocated: 2.288818359375e-05 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:27,
            Percentage of VRAM taken: 69.91579512747062%,
            
[2025-08-20 23:34:01,000][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/policy_optimizer_state.pt
[2025-08-20 23:34:01,002][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/critic_optimizer_state.pt
[2025-08-20 23:34:02,242][__main__][INFO] - Iteration 994 took 55s (37.70% Gen, 62.30% Train). Generation: 20s, Training: 34s. Estimated remaining time: -403s. Estimated total time: 15h 16m 47s. Time estimates for 10 more iterations: 9m 10s, 100 more iterations: 1h 31m 40s, 500 more iterations: 7h 38m 23s.
[2025-08-20 23:34:02,244][__main__][INFO] - Starting iteration 994.
[2025-08-20 23:34:24,974][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 23:34:24,975][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 23:34:24,982][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 23:34:27,424][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 23:34:27,425][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 23:34:27,432][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 23:34:27,434][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data.
[2025-08-20 23:34:27,435][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets.
[2025-08-20 23:34:27,733][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 23:34:28,523][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 23:34:29,312][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 23:34:30,105][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 23:34:30,894][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 23:34:31,685][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 23:34:32,477][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 23:34:33,268][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 23:34:34,060][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 23:34:34,851][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 23:34:35,645][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 23:34:36,436][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 23:34:37,227][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 23:34:38,020][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 23:34:38,812][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 23:34:40,047][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 23:34:40,841][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 23:34:41,635][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 23:34:42,428][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 23:34:43,224][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 23:34:44,018][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 23:34:44,811][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 23:34:45,607][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 23:34:46,404][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 23:34:47,199][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 23:34:47,992][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 23:34:48,786][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 23:34:49,583][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 23:34:50,380][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 23:34:51,176][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 23:34:51,970][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 23:34:52,765][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 23:34:54,350][mllm.training.trainer_common][INFO] - For task: Apply reinforce step
,
            ΔVRAM Allocated: 2.288818359375e-05 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:26,
            Percentage of VRAM taken: 69.91579512747062%,
            
[2025-08-20 23:34:55,302][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/policy_optimizer_state.pt
[2025-08-20 23:34:55,304][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/critic_optimizer_state.pt
[2025-08-20 23:34:56,700][__main__][INFO] - Iteration 995 took 54s (37.23% Gen, 62.77% Train). Generation: 20s, Training: 34s. Estimated remaining time: -1009s. Estimated total time: 15h 7m 35s. Time estimates for 10 more iterations: 9m 4s, 100 more iterations: 1h 30m 45s, 500 more iterations: 7h 33m 47s.
[2025-08-20 23:34:56,702][__main__][INFO] - Starting iteration 995.
[2025-08-20 23:35:19,544][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 23:35:19,546][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 23:35:19,552][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 23:35:22,008][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 23:35:22,009][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 23:35:22,016][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 23:35:22,018][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data.
[2025-08-20 23:35:22,019][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets.
[2025-08-20 23:35:22,319][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 23:35:23,108][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 23:35:23,897][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 23:35:24,688][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 23:35:25,478][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 23:35:26,269][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 23:35:27,061][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 23:35:27,852][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 23:35:28,644][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 23:35:29,435][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 23:35:30,228][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 23:35:31,021][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 23:35:31,812][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 23:35:32,607][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 23:35:33,399][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 23:35:34,724][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 23:35:35,517][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 23:35:36,309][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 23:35:37,104][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 23:35:37,898][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 23:35:38,692][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 23:35:39,489][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 23:35:40,285][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 23:35:41,078][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 23:35:41,871][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 23:35:42,667][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 23:35:43,463][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 23:35:44,258][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 23:35:45,052][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 23:35:45,848][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 23:35:46,646][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 23:35:47,441][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 23:35:49,079][mllm.training.trainer_common][INFO] - For task: Apply reinforce step
,
            ΔVRAM Allocated: 2.288818359375e-05 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:27,
            Percentage of VRAM taken: 69.91579512747062%,
            
[2025-08-20 23:35:50,070][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/policy_optimizer_state.pt
[2025-08-20 23:35:50,071][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/critic_optimizer_state.pt
[2025-08-20 23:35:51,355][__main__][INFO] - Iteration 996 took 54s (37.28% Gen, 62.72% Train). Generation: 20s, Training: 34s. Estimated remaining time: -867s. Estimated total time: 15h 10m 52s. Time estimates for 10 more iterations: 9m 6s, 100 more iterations: 1h 31m 5s, 500 more iterations: 7h 35m 26s.
[2025-08-20 23:35:51,356][__main__][INFO] - Starting iteration 996.
[2025-08-20 23:36:14,781][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 23:36:14,782][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 23:36:14,788][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 23:36:17,242][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 23:36:17,243][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 23:36:17,250][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 23:36:17,253][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data.
[2025-08-20 23:36:17,253][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets.
[2025-08-20 23:36:17,552][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 23:36:18,342][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 23:36:19,136][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 23:36:19,930][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 23:36:20,723][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 23:36:21,516][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 23:36:22,313][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 23:36:23,108][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 23:36:23,902][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 23:36:24,695][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 23:36:25,492][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 23:36:26,286][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 23:36:27,080][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 23:36:27,875][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 23:36:28,672][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 23:36:29,467][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 23:36:30,261][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 23:36:31,055][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 23:36:31,852][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 23:36:32,648][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 23:36:33,442][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 23:36:34,236][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 23:36:35,547][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 23:36:36,340][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 23:36:37,134][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 23:36:37,930][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 23:36:38,728][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 23:36:39,524][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 23:36:40,318][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 23:36:41,113][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 23:36:41,910][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 23:36:42,707][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 23:36:44,362][mllm.training.trainer_common][INFO] - For task: Apply reinforce step
,
            ΔVRAM Allocated: 2.288818359375e-05 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:27,
            Percentage of VRAM taken: 69.91579512747062%,
            
[2025-08-20 23:36:45,312][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/policy_optimizer_state.pt
[2025-08-20 23:36:45,314][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/critic_optimizer_state.pt
[2025-08-20 23:36:46,668][__main__][INFO] - Iteration 997 took 55s (37.93% Gen, 62.07% Train). Generation: 20s, Training: 34s. Estimated remaining time: -264s. Estimated total time: 15h 21m 51s. Time estimates for 10 more iterations: 9m 13s, 100 more iterations: 1h 32m 11s, 500 more iterations: 7h 40m 55s.
[2025-08-20 23:36:46,669][__main__][INFO] - Starting iteration 997.
[2025-08-20 23:37:09,427][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 23:37:09,428][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 23:37:09,434][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 23:37:11,892][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 23:37:11,893][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 23:37:11,900][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 23:37:11,902][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data.
[2025-08-20 23:37:11,903][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets.
[2025-08-20 23:37:12,201][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 23:37:12,993][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 23:37:13,784][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 23:37:14,578][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 23:37:15,371][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 23:37:16,162][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 23:37:16,956][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 23:37:17,750][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 23:37:18,543][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 23:37:19,337][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 23:37:20,133][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 23:37:20,926][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 23:37:21,719][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 23:37:22,515][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 23:37:23,310][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 23:37:24,103][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 23:37:24,896][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 23:37:26,105][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 23:37:26,899][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 23:37:27,694][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 23:37:28,489][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 23:37:29,282][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 23:37:30,075][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 23:37:30,872][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 23:37:31,667][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 23:37:32,461][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 23:37:33,256][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 23:37:34,052][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 23:37:34,848][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 23:37:35,642][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 23:37:36,436][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 23:37:37,233][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 23:37:38,820][mllm.training.trainer_common][INFO] - For task: Apply reinforce step
,
            ΔVRAM Allocated: 2.288818359375e-05 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:26,
            Percentage of VRAM taken: 69.91579512747062%,
            
[2025-08-20 23:37:39,773][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/policy_optimizer_state.pt
[2025-08-20 23:37:39,774][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/critic_optimizer_state.pt
[2025-08-20 23:37:41,256][__main__][INFO] - Iteration 998 took 54s (37.18% Gen, 62.81% Train). Generation: 20s, Training: 34s. Estimated remaining time: -1044s. Estimated total time: 15h 9m 46s. Time estimates for 10 more iterations: 9m 5s, 100 more iterations: 1h 30m 58s, 500 more iterations: 7h 34m 53s.
[2025-08-20 23:37:41,257][__main__][INFO] - Starting iteration 998.
[2025-08-20 23:38:04,008][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 23:38:04,009][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 23:38:04,016][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 23:38:06,483][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 23:38:06,485][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 23:38:06,491][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 23:38:06,493][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data.
[2025-08-20 23:38:06,494][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets.
[2025-08-20 23:38:06,793][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 23:38:07,582][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 23:38:08,376][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 23:38:09,167][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 23:38:09,961][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 23:38:10,756][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 23:38:11,548][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 23:38:12,341][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 23:38:13,137][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 23:38:13,930][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 23:38:14,723][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 23:38:15,519][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 23:38:16,314][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 23:38:17,107][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 23:38:17,900][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 23:38:18,696][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 23:38:19,490][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 23:38:20,284][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 23:38:21,079][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 23:38:21,876][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 23:38:23,171][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 23:38:23,965][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 23:38:24,758][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 23:38:25,554][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 23:38:26,349][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 23:38:27,143][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 23:38:27,937][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 23:38:28,734][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 23:38:29,531][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 23:38:30,324][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 23:38:31,118][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 23:38:31,914][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 23:38:33,540][mllm.training.trainer_common][INFO] - For task: Apply reinforce step
,
            ΔVRAM Allocated: 2.288818359375e-05 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:27,
            Percentage of VRAM taken: 69.91579512747062%,
            
[2025-08-20 23:38:34,487][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/policy_optimizer_state.pt
[2025-08-20 23:38:34,488][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/critic_optimizer_state.pt
[2025-08-20 23:38:35,845][__main__][INFO] - Iteration 999 took 54s (37.15% Gen, 62.85% Train). Generation: 20s, Training: 34s. Estimated remaining time: -1096s. Estimated total time: 15h 9m 47s. Time estimates for 10 more iterations: 9m 5s, 100 more iterations: 1h 30m 58s, 500 more iterations: 7h 34m 53s.
[2025-08-20 23:38:35,847][__main__][INFO] - Starting iteration 999.
[2025-08-20 23:38:58,539][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 23:38:58,540][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 23:38:58,547][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 23:39:01,021][mllm.training.trainer_ad_align][INFO] - For task: Create alternative trajectory batch
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 23:39:01,022][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 23:39:01,028][mllm.training.trainer_ad_align][INFO] - For task: Compute alternative advantage estimates
,
            ΔVRAM Allocated: 0.0 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:00,
            Percentage of VRAM taken: 69.91574385208075%,
            
[2025-08-20 23:39:01,031][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data.
[2025-08-20 23:39:01,031][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets.
[2025-08-20 23:39:01,329][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 23:39:02,121][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 23:39:02,912][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 23:39:03,705][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 23:39:04,501][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 23:39:05,294][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 23:39:06,087][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 23:39:06,883][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 23:39:07,677][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 23:39:08,470][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 23:39:09,264][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 23:39:10,061][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 23:39:10,855][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 23:39:11,647][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 23:39:12,441][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 23:39:13,235][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 23:39:14,028][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 23:39:14,822][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 23:39:15,620][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 23:39:16,416][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 23:39:17,210][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 23:39:18,003][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 23:39:18,800][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 23:39:19,597][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 23:39:20,392][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 23:39:21,186][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 23:39:21,981][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 23:39:23,255][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 23:39:24,049][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 23:39:24,842][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 23:39:25,638][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 23:39:26,434][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 8960 tokens.
[2025-08-20 23:39:28,070][mllm.training.trainer_common][INFO] - For task: Apply reinforce step
,
            ΔVRAM Allocated: 2.288818359375e-05 GB,
            ΔVRAM Reserved: 0.0 GB,
            ΔTime: 00:00:27,
            Percentage of VRAM taken: 69.91579512747062%,
            
[2025-08-20 23:39:29,001][mllm.training.trainer_common][INFO] - Saved main optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/policy_optimizer_state.pt
[2025-08-20 23:39:29,003][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /network/scratch/m/mohammed.muqeeth/llm_negotiation/2025_08/ipd_prev_ad_align_qwen2.5_7b/seed_1000/agent_trainer/critic_optimizer_state.pt
[2025-08-20 23:39:30,310][__main__][INFO] - Iteration 1000 took 54s (37.14% Gen, 62.86% Train). Generation: 20s, Training: 34s. Estimated remaining time: -1276s. Estimated total time: 15h 7m 42s. Time estimates for 10 more iterations: 9m 4s, 100 more iterations: 1h 30m 46s, 500 more iterations: 7h 33m 51s.