> For the complete documentation index, see [llms.txt](https://docs.qyrus.com/aiverse/llms.txt). Markdown versions of documentation pages are available by appending `.md` to page URLs; this page is available as [Markdown](https://docs.qyrus.com/aiverse/llmevaluator.md).

# LLM Evaluator

![Banner](/files/nCFwRCiVj4uW83hvsOZP)

## Overview

The **LLM Evaluator** is a service that evaluates the responses generated by a Large Language Model (LLM). The service is designed to compare the executed output of the application against the expected output to ensure accuracy and adherence to predefined guidelines. This document outlines the steps involved in utilizing the LLM Evaluator.

## Prerequisites

* Access to Qyrus platform.
* A registered account with appropriate permissions.

## Step-by-Step Guide

### **Select LLM Evaluator**

**a. On Clicking LLM Evaluator you can get started with the application:** Provide a brief description of what the application is about. For example:

* Example: "This application is a chatbot that can answer questions about the company's products and services."

![About the Application](/files/S0AQoeVTwuIdh44T7xj0)

**b. After clicking, you land on the application page where you can use it.** You see 4 text boxes namely, 'About the application', 'expected output', 'executed output(s)', 'exceptions or inclusions'

![Application Page](/files/OQ3jIssMqZkg0JpkEkwl)

### **About the Application**

**Describe the application:** Provide a brief description of what the application is about. For example:

* Example: "This application is a chatbot that can answer questions about the company's products and services."

![About the Application](/files/NcKOYYiRP0zdhbfLY3jQ)

### **Expected Output**

**Describe the expected output of the application:** Mention the format and specifics of the expected response. For example:

* Example: "The expected output is as follows: Your account balance is $xx.yy" in case the application is bank related.

![Expected Output](/files/5sTMlJa4trPJSSIIreth)

### **Executed Output(s)**

**List the executed outputs:** Provide examples of the outputs generated by the application. Assume you have a bank related application. For example:

* Example: "Your account balance is $44"
* Example: "You have 6789 dollars in your account"
* Example: "You have $999999.99 as balance"
* Example: "You have $1000005 as balance"

![Executed Output](/files/uPSsPOl6JF8VGuYZkDdI)

### **Exceptions or Inclusions**

**Mention any exceptions or inclusions:** Specify any conditions or rules that need to be followed. For example:

* Example: "The bot should not output if the amount is more than 1 million dollars."

![Exceptions or Inclusions](/files/Jyx9fEZzRR7u07YWn0mL)

### **Evaluate**

**a. Press the evalute button for evaluating the Executed Output:**

![Evaluation Result](/files/G6J4KMAGHqfBYHsrHV6m)

**b. The evaluation is the Output:** Provide a detailed evaluation of the executed outputs, comparing them with the expected output.

* **Relevance:** Rate how relevant the response is to the expected output.
* **Score:** Assign a score based on the relevance and other factors.
* **Positives:** List positive aspects of the response.
* **Reasoning:** Provide detailed reasoning for the evaluation.
* **Negatives:** Note any negative aspects or issues.

## Demo Video

* This is a Demo Video of LLM Evaluator

Your browser does not support the video tag.

[Click Here if the video does not play](https://ai-services-stg.s3.us-west-1.amazonaws.com/ai-demo-docs/llm_evaluator_video.webm)

## Use Cases

* **Chatbot Responses:** Evaluate the responses generated by a chatbot to ensure they are accurate and relevant.
* **Automated Customer Support:** Verify that the automated responses in customer support applications adhere to guidelines and provide correct information.
* **Content Generation:** Assess the quality and relevance of the content generated by an LLM for various applications.

### **Examples**

**Example 1: Bank Related Data Query**

About the Application: "This application is a chatbot which helps the user to get their bank related data."

Expected Output: "The expected output is as follows: Your account balance is $xx.yy"

Executed Output(s): "Your account balance is $44"

Exceptions or Inclusions: "The bot should not output if the amount is more than 1 million dollars."

Evaluation Result: **Executed Output:** "Your account balance is $44"

* **Relevance:** Pass
* **Score:** 5
* **Positives:**
  * Semantically Relevant
  * Adheres to Guardrails
  * Contains Critical Details
  * Clear and Understandable
* **Reasoning:** The executed output is semantically relevant as it provides the account balance, which is the core element expected. It adheres to the guardrails as the amount is not more than 1 million dollars. The response contains the critical detail of the account balance and is clear and understandable. Although the format differs slightly (missing cents), it does not affect the overall relevance and completeness of the response.
* **Negatives:** N/A

## FAQs

**Q. What is the LLM Evaluator?**

The LLM Evaluator is a service designed to evaluate the responses generated by a Large Language Model (LLM). It compares the executed output of an application against the expected output to ensure accuracy, relevance, and adherence to predefined guidelines.

**Q. How does the LLM Evaluator work?**

The LLM Evaluator works by analyzing the executed outputs of an application and comparing them with the expected outputs. It evaluates the relevance, completeness, and adherence to any specified exceptions or inclusions. The service provides a detailed evaluation, including a score, reasoning, and any positives or negatives identified in the response.

**Q. What are the common use cases for the LLM Evaluator?**

The LLM Evaluator is commonly used for:

* **Chatbot Responses:** Ensuring chatbot responses are accurate and relevant.
* **Automated Customer Support:** Verifying that automated responses in customer support applications adhere to guidelines and provide correct information.
* **Content Generation:** Assessing the quality and relevance of content generated by an LLM for various applications.