lmms_eval/tasks/mme/mme.yaml (35 lines of code) (raw):

dataset_path: lmms-lab/MME dataset_kwargs: token: True task: "mme" test_split: test output_type: generate_until doc_to_visual: !function utils.mme_doc_to_visual doc_to_text: !function utils.mme_doc_to_text doc_to_target: "answer" generation_kwargs: max_new_tokens: 16 temperature: 0 top_p: 0 num_beams: 1 do_sample: false # The return value of process_results will be used by metrics process_results: !function utils.mme_process_results # Note that the metric name can be either a registed metric function (such as the case for GQA) or a key name returned by process_results metric_list: - metric: mme_percetion_score aggregation: !function utils.mme_aggregate_results higher_is_better: true - metric: mme_cognition_score aggregation: !function utils.mme_aggregate_results higher_is_better: true model_specific_prompt_kwargs: default: pre_prompt: "" post_prompt: "\nAnswer the question using a single word or phrase." qwen_vl: pre_prompt: "" post_prompt: " Answer:" otterhd: pre_prompt: "" post_prompt: " Answer:" metadata: - version: 0.0