lm_eval/tasks/leaderboard/gpqa/_leaderboard_gpqa.yaml (9 lines of code) (raw):

group: leaderboard_gpqa task: - leaderboard_gpqa_diamond - leaderboard_gpqa_extended - leaderboard_gpqa_main aggregate_metric_list: - metric: acc_norm aggregation: mean weight_by_size: true