recipes/Qwen2.5-Coder-7B-Instruct/grpo/config

# Model arguments model_name_or_path: Qwen/Qwen2.5-Coder-7B-Instruct model_revision: main torch_dtype: bfloat16 attn_implementation: flash_attention_2 # Data training arguments dataset_name: open-r1/codeforces dataset_prompt_column: prompt dataset_config: verifiable-prompts dataset_test_split: test dataset_train_split: train system_prompt: "You are a helpful AI Assistant that provides well-reasoned and detailed responses. You first think about the reasoning process as an internal monologue and then provide the user with the answer. Respond in the following format: <think>\n...\n</think>\n<answer>\n...\n</answer>" # GRPO trainer config callbacks: - push_to_hub_revision benchmarks: - lcb_v4 beta: 0.0 loss_type: dr_grpo scale_rewards: false bf16: true do_eval: false eval_strategy: "no" use_vllm: true vllm_device: auto vllm_gpu_memory_utilization: 0.7 gradient_accumulation_steps: 32 gradient_checkpointing: true gradient_checkpointing_kwargs: use_reentrant: false hub_model_id: open-r1/Qwen2.5-Coder-7B-Instruct-Codeforces-GRPO hub_model_revision: v01.00 hub_strategy: every_save learning_rate: 1.0e-06 log_completions: true log_level: info logging_first_step: true logging_steps: 1 logging_strategy: steps lr_scheduler_type: constant_with_warmup max_grad_norm: 0.2 max_prompt_length: 2000 max_completion_length: 8192 max_steps: -1 num_generations: 16 # aiming for 1k optimization steps # total_samples_per_batch = num_gpus * grad_accumulation_steps * per_device_batch_size = 8 * 32 * 4 = 1024 # unique_prompts_per_batch = total_samples_per_batch / num_generations = 1024 / 16 = 64 # #dataset ~= 16k (8k * 2, for python and cpp) # global_steps_per_epoch = #dataset / unique_prompts_per_batch = 16k / 64 ~= 250 # epochs_for_1k_steps = 1000/250 = 4 epochs num_train_epochs: 4 output_dir: data/Qwen2.5-Coder-7B-Instruct-Codeforces-GRPO_v01.00 overwrite_output_dir: true per_device_train_batch_size: 4 push_to_hub: true report_to: - wandb reward_funcs: - cf_code - code_format reward_weights: - 1.0 - 0.1 save_strategy: "steps" save_steps: 0.05 save_total_limit: 1 seed: 42 temperature: 0.7 wandb_entity: huggingface wandb_project: open-r1 warmup_ratio: 0.1 mask_truncated_completions: true # for each generation, evaluate these many test cases in parallel, then check if any of them failed (0 score): if so stop evaluating # otherwise continue with the next batch of test cases. Useful to avoid overloading the eval server + save time on wrong solutions code_eval_test_batch_size: -1 code_eval_scoring_mode: weighted_sum

recipes/Qwen2.5-Coder-7B-Instruct/grpo/config_codeforces.yaml (66 lines of code) (raw):