8-evals/data/pfoo_eval_summary_grader.yaml (252 lines of code) (raw):

prompts: - 'Evaluate whether the items in summary_key_points are covered in the summary, both provided in <context> tags. Use a full point (1) for a strong match and a half point (0.5) for a partial match. If there are no points in summary_key_points, then assign a score of 0.0. <context> {{input}} </context> Think step by step and output the final ratio on a new line at the end of the output, formatted like "Fraction: 0.25", with no additional markdown or ``` characters. ' providers: - id: openai:gpt-3.5-turbo - id: openai:gpt-4o-mini - id: openai:gpt-4o-2024-08-06 - id: openai:o1-mini - id: openai:o1-preview tests: - assert: - metric: frac_key_points_in_summary type: python value: '''1.0'' in output.split(''\n'')[-1]' description: Test if summary is correct vars: input: '<summary_key_points> - Version 2.5 release scheduled for January 10th. - New user interface design finalized. - Security vulnerabilities from audit addressed. - Mobile app integration added to the roadmap. - Training sessions for the support team planned for December. </summary_key_points> <summary> In the latest software update meeting, the team confirmed that Version 2.5 is scheduled for release on January 10th. The new user interface design has been finalized, promising a better user experience. All security vulnerabilities identified during the recent audit have been addressed. Additionally, mobile app integration has been added to the project roadmap. Training sessions for the support team are planned for December to prepare for the upcoming changes. </summary>' - assert: - metric: frac_key_points_in_summary type: python value: '''1.0'' in output.split(''\n'')[-1]' description: Test if summary is correct vars: input: '<summary_key_points> - Revenue increased by 10% compared to last quarter. - Marketing expenses exceeded budget by 5%. - New product launch scheduled for next month. - Customer satisfaction ratings improved by 15%. - Plan to expand into two new markets in Q3. </summary_key_points> <summary> In the meeting held on April 10th, the team reviewed the quarterly financial results. Revenue increased by 10% compared to the previous quarter, although marketing expenses exceeded the budget by 5%. The team was pleased to note that customer satisfaction ratings improved by 15%. They discussed the upcoming new product launch scheduled for next month and plans to expand into two new markets in the third quarter. </summary>' - assert: - metric: frac_key_points_in_summary type: python value: '''1.0'' in output.split(''\n'')[-1]' description: Test if summary is correct vars: input: '<summary_key_points> - Mobile app beta testing extended by one week. - Backend API integration completed successfully. - Marketing campaign to begin on December 1st. - Next sprint planning meeting scheduled for November 20th. </summary_key_points> <summary> During the team meeting on November 15th, it was announced that the mobile app beta testing phase has been extended by one week to ensure quality. The backend API integration was completed successfully, which will enhance app performance. The marketing campaign is scheduled to commence on December 1st. The next sprint planning meeting is set for November 20th to discuss upcoming tasks and goals. </summary>' - assert: - metric: frac_key_points_in_summary type: python value: '''0.5'' in output.split(''\n'')[-1]' description: Test if summary is correct vars: input: '<summary_key_points> - Software Build Version 2.1 failed quality assurance tests. - New feature X implementation postponed to next sprint. - Team member John is on sick leave for the rest of the week. - Client meeting scheduled for Friday rescheduled to next Monday. - Budget for Project Alpha increased by 10%. </summary_key_points> <summary> In today''s team meeting, we reviewed the current project statuses. The upcoming client meeting has been rescheduled to next Monday. Additionally, the budget for Project Alpha has been increased by 10%, which will allow us to allocate more resources. Team member John will be on sick leave for the rest of the week, and we wish him a speedy recovery. Overall, the team remains focused on delivering quality results. </summary>' - assert: - metric: frac_key_points_in_summary type: python value: '''0.4'' in output.split(''\n'')[-1]' description: Test if summary is correct vars: input: '<summary_key_points> - The marketing campaign launch date has been moved from December 1st to December 5th. - The design team completed the new logo. - Budget allocation for Q1 has been increased by 10%. - The client requested additional features for the mobile app. - The next meeting is scheduled for next Monday at 10 AM. </summary_key_points> <summary> At the team meeting on November 10th, the design team announced the completion of the new company logo. The team expressed appreciation for the hard work put into the design. Additionally, it was agreed that the next meeting will be scheduled for next Monday at 10 AM to discuss ongoing projects and address any concerns. The team is encouraged to come prepared with updates on their respective tasks. </summary>' - assert: - metric: frac_key_points_in_summary type: python value: '''0.4'' in output.split(''\n'')[-1]' description: Test if summary is correct vars: input: '<summary_key_points> - Backend API integration delayed due to authentication issues. - UI design changes need approval from the client. - Marketing campaign launch moved from December 1st to December 5th. - New team member joining next week. - Budget increased by 10% to accommodate additional resources. </summary_key_points> <summary> In today''s project meeting, we discussed the delay in integrating the backend API, which is facing authentication issues. The team is working on resolving these as a priority. Additionally, the marketing campaign launch has been rescheduled to December 5th to align with the updated development timeline. We are focusing on these critical tasks to ensure project success. </summary>' - assert: - metric: frac_key_points_in_summary type: python value: '''0.0'' in output.split(''\n'')[-1]' description: Test if summary is correct vars: input: '<summary_key_points> </summary_key_points> <summary> In the team meeting on August 12th, the progress of the Beta project was discussed. The development team reported that the UI redesign is on track for completion by the end of the week. However, testing has uncovered several critical bugs that need immediate attention. The team agreed to prioritize bug fixes over new feature development for the next sprint. The release date remains unchanged, but the situation requires close monitoring. </summary>' - assert: - metric: frac_key_points_in_summary type: python value: '''0.0'' in output.split(''\n'')[-1]' description: Test if summary is correct vars: input: '<summary_key_points> </summary_key_points> <summary> During the team meeting on September 14th, several issues were discussed regarding the Beta project. The development team reported progress on the new features but identified several bugs that need to be resolved before the next release. The testing team is working diligently to ensure all features are thoroughly tested. The project manager emphasized the importance of meeting the upcoming deadline and encouraged collaboration between teams. </summary>' - assert: - metric: frac_key_points_in_summary type: python value: '''0.0'' in output.split(''\n'')[-1]' description: Test if summary is correct vars: input: '<summary_key_points> </summary_key_points> <summary> During the project update meeting on December 1st, the team reviewed the progress of the Delta rollout. The API development has been completed ahead of schedule, allowing for early integration testing. Marketing materials are in the final stages of approval and will be released next week. The team also discussed potential risks associated with third-party dependencies but agreed on mitigation strategies. Overall, the project is on track for the planned release date of December 15th. </summary>' - assert: - metric: frac_key_points_in_summary type: python value: '''0.875'' in output.split(''\n'')[-1]' description: Test if summary is correct vars: input: '<summary_key_points> - The client requested additional features for the mobile app. - The backend team reported delays due to unexpected bugs. - The marketing campaign is set to launch next Monday. - Budget constraints may affect the Q2 development plans. </summary_key_points> <summary> In today''s team meeting, we discussed recent client feedback and project challenges. Concerns about our delivery schedule have arisen due to technical issues encountered. Additionally, preparations for upcoming promotional activities are progressing, targeting an early next week launch. Financial limitations were also highlighted as influencing our future planning. </summary>' - assert: - metric: frac_key_points_in_summary type: python value: '''0.6'' in output.split(''\n'')[-1]' description: Test if summary is correct vars: input: '<summary_key_points> - CEO plans to retire at the end of the year. - New product line launch scheduled for Q2 next year. - Marketing budget increased by 20% for next quarter. - Remote work policy extended indefinitely. - Company facing a lawsuit over patent infringement. </summary_key_points> <summary> In the recent board meeting, several significant announcements were made impacting the company''s future. Leadership changes are on the horizon as top executives consider their long-term roles. The introduction of innovative offerings is planned for mid-next year, aligning with the company''s growth strategy. Investment in promotional activities will see a noticeable increase, reflecting the importance of market presence. The organizational policies continue to adapt to the evolving work environment. Legal challenges present new obstacles that the company is preparing to address proactively. </summary>' - assert: - metric: frac_key_points_in_summary type: python value: '''0.5'' in output.split(''\n'')[-1]' description: Test if summary is correct vars: input: '<summary_key_points> - Quarterly sales increased by 5% compared to last quarter. - Marketing campaign extended by two weeks due to positive feedback. - Budget for R&D increased by $500,000. - New office in San Francisco delayed due to permit issues. </summary_key_points> <summary> During the quarterly meeting, the team discussed recent performance metrics. Sales have improved compared to the previous quarter, indicating a positive trend. Given the favorable response, the marketing efforts will be prolonged to capitalize on this momentum. There have been adjustments in the budget allocations, particularly enhancing support for research and development initiatives. Meanwhile, the planned opening of the new office location is facing delays owing to certain logistical challenges. </summary>'