LLM Evaluator

Overview
The LLM Evaluator is a service that evaluates the responses generated by a Large Language Model (LLM). The service is designed to compare the executed output of the application against the expected output to ensure accuracy and adherence to predefined guidelines. This document outlines the steps involved in utilizing the LLM Evaluator.
Prerequisites
Access to Qyrus platform.
A registered account with appropriate permissions.
Step-by-Step Guide
Select LLM Evaluator
a. On Clicking LLM Evaluator you can get started with the application: Provide a brief description of what the application is about. For example:
Example: "This application is a chatbot that can answer questions about the company's products and services."

b. After clicking, you land on the application page where you can use it. You see 4 text boxes namely, 'About the application', 'expected output', 'executed output(s)', 'exceptions or inclusions'

About the Application
Describe the application: Provide a brief description of what the application is about. For example:
Example: "This application is a chatbot that can answer questions about the company's products and services."

Expected Output
Describe the expected output of the application: Mention the format and specifics of the expected response. For example:
Example: "The expected output is as follows: Your account balance is $xx.yy" in case the application is bank related.

Executed Output(s)
List the executed outputs: Provide examples of the outputs generated by the application. Assume you have a bank related application. For example:
Example: "Your account balance is $44"
Example: "You have 6789 dollars in your account"
Example: "You have $999999.99 as balance"
Example: "You have $1000005 as balance"

Exceptions or Inclusions
Mention any exceptions or inclusions: Specify any conditions or rules that need to be followed. For example:
Example: "The bot should not output if the amount is more than 1 million dollars."

Evaluate
a. Press the evalute button for evaluating the Executed Output:

b. The evaluation is the Output: Provide a detailed evaluation of the executed outputs, comparing them with the expected output.
Relevance: Rate how relevant the response is to the expected output.
Score: Assign a score based on the relevance and other factors.
Positives: List positive aspects of the response.
Reasoning: Provide detailed reasoning for the evaluation.
Negatives: Note any negative aspects or issues.
Demo Video
This is a Demo Video of LLM Evaluator
Your browser does not support the video tag.
Click Here if the video does not play
Use Cases
Chatbot Responses: Evaluate the responses generated by a chatbot to ensure they are accurate and relevant.
Automated Customer Support: Verify that the automated responses in customer support applications adhere to guidelines and provide correct information.
Content Generation: Assess the quality and relevance of the content generated by an LLM for various applications.
Examples
Example 1: Bank Related Data Query
About the Application: "This application is a chatbot which helps the user to get their bank related data."
Expected Output: "The expected output is as follows: Your account balance is $xx.yy"
Executed Output(s): "Your account balance is $44"
Exceptions or Inclusions: "The bot should not output if the amount is more than 1 million dollars."
Evaluation Result: Executed Output: "Your account balance is $44"
Relevance: Pass
Score: 5
Positives:
Semantically Relevant
Adheres to Guardrails
Contains Critical Details
Clear and Understandable
Reasoning: The executed output is semantically relevant as it provides the account balance, which is the core element expected. It adheres to the guardrails as the amount is not more than 1 million dollars. The response contains the critical detail of the account balance and is clear and understandable. Although the format differs slightly (missing cents), it does not affect the overall relevance and completeness of the response.
Negatives: N/A
FAQs
Q. What is the LLM Evaluator?
The LLM Evaluator is a service designed to evaluate the responses generated by a Large Language Model (LLM). It compares the executed output of an application against the expected output to ensure accuracy, relevance, and adherence to predefined guidelines.
Q. How does the LLM Evaluator work?
The LLM Evaluator works by analyzing the executed outputs of an application and comparing them with the expected outputs. It evaluates the relevance, completeness, and adherence to any specified exceptions or inclusions. The service provides a detailed evaluation, including a score, reasoning, and any positives or negatives identified in the response.
Q. What are the common use cases for the LLM Evaluator?
The LLM Evaluator is commonly used for:
Chatbot Responses: Ensuring chatbot responses are accurate and relevant.
Automated Customer Support: Verifying that automated responses in customer support applications adhere to guidelines and provide correct information.
Content Generation: Assessing the quality and relevance of content generated by an LLM for various applications.
Last updated