Semantic Evaluation

To validate dynamic outputs agianst an expected output semantically using our AI based service LLM Evaluator

Usage 🚀

Refer to Setup Guide for installing dependecies for both Java and Python.

JAVA

There are 2 clients available with the Java SDK: SyncClient and AsyncClient

Java Client Code

# for async client 
import org.qyrus.ai_sdk.Clients.AsyncClient;
AsyncClient client = new AsyncClient(<API_TOKEN>, null);

# for sync client
import org.qyrus.ai_sdk.Clients.SyncClient;
SyncClient client = new SyncClient(<API_TOKEN>, null);

Using SyncClient

Here's an example of utilizing LLM Evaluator with SyncClient.

Java SyncClient Code

private static void testLLMEvaluator() {

        Dotenv dotenv = Dotenv.load();
        String QYRUS_AI_SDK_API_TOKEN = dotenv.get("QYRUS_AI_SDK_API_TOKEN");

        SyncClient client = new SyncClient(QYRUS_AI_SDK_API_TOKEN, null);
        String context = "application is about generating dynamic text for messages on phone";
        String expected_output = "Winning lottery of 10k$";
        List<String> executed_output = new ArrayList<>();
        executed_output.add("You have won 10000 dollars");
        String guardrails = "No sensititve info";


        long startTime = System.currentTimeMillis();
        int numberOfRequests = 1;

        for (int i = 0; i < numberOfRequests; i++) {
            try {
                // Assuming there is an api_builder field in SyncClient and a create method that matches the described input
                LLMEval.LLMEvalResponse response = client.llmevaluator.evaluate(context, expected_output, executed_output, guardrails);
                // Assuming the APIBuilderResponse class has a getSwaggerJson method to retrieve the swagger json
                String report = response.getReport();
                System.out.println("Report: " + report);
                
            } catch (Exception e) {
                e.printStackTrace();
            }
        }

        long endTime = System.currentTimeMillis();
        System.out.println("Synchronous Total time for LLM Eval request: " + (endTime - startTime) + " ms");
    }

Using AsyncClient

Here's an example of utilizing LLM Evaluator with AsyncClient.

Java AsyncClient Code

private static void testAsyncLLMEvaluator() {

        Dotenv dotenv = Dotenv.load();
        String QYRUS_AI_SDK_API_TOKEN = dotenv.get("QYRUS_AI_SDK_API_TOKEN");

        AsyncClient client = new AsyncClient(QYRUS_AI_SDK_API_TOKEN, null);
        String context = "application is about generating dynamic text for messages on phone";
        String expected_output = "Winning lottery of 10k$";
        List<String> executed_output = new ArrayList<>();
        executed_output.add("You have won 10000 dollars");
        String guardrails = "No sensititve info";

        long startTime = System.currentTimeMillis();
        int numberOfRequests = 1;
        List<CompletableFuture<AsyncLLMEval.LLMEvalResponse>> futures = new ArrayList<>();
        for (int i = 0; i < numberOfRequests; i++) {
            
            // Assuming there is an api_builder field in SyncClient and a create method that matches the described input
            CompletableFuture<AsyncLLMEval.LLMEvalResponse> responseFuture = client.llmevaluator.evaluate(context, expected_output, executed_output, guardrails);
            // Assuming the APIBuilderResponse class has a getSwaggerJson method to retrieve the swagger json

            responseFuture.thenAccept(response -> {
                String report = response.getReport();
                System.out.println("Generated report JSON: " + report);
                
             }).exceptionally(ex -> {
            System.out.println("An error occurred: " + ex.getMessage());
            ex.printStackTrace();
            return null;
        });

        // // Keep your program running until all futures are resolved
        // responseFuture.join(); // Block and wait for the future to complete
        futures.add(responseFuture);
        }
        CompletableFuture.allOf(futures.toArray(new CompletableFuture[0])).join();

        long endTime = System.currentTimeMillis();
        System.out.println("Asynchronous total time for LLM Eval" + numberOfRequests + " requests: " + (endTime - startTime) + " ms");
    }

PYTHON

There are two clients available with the Python SDK: SyncQyrusAI and AsyncQyrusAI.

Python Client Code

# for sync client
from qyrusai import SyncQyrusAI
client = SyncQyrusAI(api_key=QYRUS_API_TOKEN)

# for async client
from qyrusai import AsyncQyrusAI
client = AsyncQyrusAI(api_key=QYRUS_API_TOKEN)

Using SyncQyrusAI

Here's an example of utilizing LLM Evaluator with SyncQyrusAI.

Python SyncQyrusAI Code

import os
import json
from qyrusai import SyncQyrusAI

QYRUS_API_TOKEN = os.environ.get("QYRUS_API_TOKEN")

client = SyncQyrusAI(api_key=QYRUS_API_TOKEN)

context = "application is about generating dynamic text for messages on phone"
expected_output = "Winning lottery of 10k$"
executed_output = "You have won 10000 dollars"
guardrails = "No sensititve info"

op = client.llm_evaluator.evaluate(context, expected_output, executed_output, guardrails)
# print(json.dumps(op.model_dump(), indent=4))
result = op.get("result")
for item in result:
    print(item["positive_points"])
    print(item["negative_points"])
    print(item["relevance_score"])
    print(item["is_relevant"])
    print(item["reasoning"])
    print(item["executed_output"])

Using AsyncQyrusAI

Here's an example of utilizing LLM Evaluator with AsyncQyrusAI.

Python AsyncQyrusAI Code

import os
import json
import asyncio
from qyrusai import AsyncQyrusAI

QYRUS_API_TOKEN = os.environ.get("QYRUS_API_TOKEN")

client = AsyncQyrusAI(api_key=QYRUS_API_TOKEN)

context = "application is about generating dynamic text for messages on phone"
expected_output = "Winning lottery of 10k$"
executed_output = "You have won 10000 dollars"
guardrails = "No sensititve info"

async def main():
    op = await client.llm_evaluator.evaluate(context, expected_output, executed_output, guardrails)
    # print(json.dumps(op.model_dump(), indent=4))
    result = op.get("result")
    for item in result:
        print(item["positive_points"])
        print(item["negative_points"])
        print(item["relevance_score"])
        print(item["is_relevant"])
        print(item["reasoning"])
        print(item["executed_output"])

# Run the async function
asyncio.run(main())

Python-only: RAG and MCP Testing

The Python SDK includes additional LLM Evaluator capabilities for RAG (Retrieval-Augmented Generation) and MCP (Model Context Protocol / tool-calling) testing. These helpers are available via the llm_evaluator.evaluator object on both AsyncQyrusAI and SyncQyrusAI.

Initialize LLM Evaluator

from qyrusai import AsyncQyrusAI
from qyrusai._types import RAGRequest, MCPRequest

client = AsyncQyrusAI(api_key="your_api_key")
evaluator = client.llm_evaluator.evaluator

Evaluate RAG (Retrieval-Augmented Generation) Systems

# Basic RAG evaluation
rag_request = RAGRequest(
    app_name="customer_support_rag",
    qid="test_001",
    question="How do I reset my password?",
    answer="Click 'Forgot Password' on the sign-in page. The reset link expires in 24 hours.",
    retrieved=[
        {
            "doc_id": "kb-12",
            "text": "To reset your password, click 'Forgot Password' on the sign-in page. An email link is valid for 24 hours.",
            "score": 0.89
        }
    ],
    citations=[0],
    params={"model": "gpt-4o-mini", "temperature": 0.2}
)

# Evaluate the RAG system
result = await evaluator.evaluate_rag(rag_request)
print(f"Status: {result['status']}")
print(f"Faithfulness: {result['metrics']['faithfulness']}")
print(f"Relevance: {result['metrics']['relevance']}")

Evaluate MCP (Model Context Protocol) Tool-Calling Systems

# Basic MCP evaluation
mcp_request = MCPRequest(
    app_name="customer_support_mcp",
    qid="test_002",
    question="What's my order status?",
    answer="Your order ABC-123 has shipped and will arrive tomorrow.",
    tools=[
        {
            "name": "orders.getStatus",
            "args": {"order_id": "ABC-123"},
            "args_valid": True,
            "status": "ok",
            "latency_ms": 150,
            "result_text": "Order ABC-123 status: shipped, estimated delivery: tomorrow"
        }
    ]
)

# Evaluate the MCP system
result = await evaluator.evaluate_mcp(mcp_request)
print(f"Status: {result['status']}")
print(f"Tool Selection Quality: {result['metrics']['tool_selection_quality']}")
print(f"Args Valid Rate: {result['metrics']['args_valid_rate']}")

Batch Evaluation

# Evaluate multiple requests at once
requests = [rag_request, mcp_request]

batch_result = await evaluator.evaluate_batch(requests)
print(f"Total: {batch_result['total']}")
print(f"Successful: {batch_result['successful']}")
print(f"Failed: {batch_result['failed']}")

for result in batch_result['results']:
    print(f"- {result['qid']}: {result['status']}")

Using JSON Input (Alternative to Pydantic)

# You can also use plain dictionaries instead of Pydantic models
rag_dict = {
    "app_name": "customer_support_rag",
    "qid": "test_003",
    "question": "What are the pricing plans?",
    "answer": "We offer Basic ($10/month) and Premium ($25/month) plans.",
    "retrieved": [
        {
            "doc_id": "pricing-1",
            "text": "Basic plan costs $10/month. Premium plan costs $25/month.",
            "score": 0.95
        }
    ],
    "citations": [0]
}

# The evaluator automatically validates and converts JSON to Pydantic
result = await evaluator.evaluate_rag(rag_dict)

Legacy Judge Evaluation (Backwards Compatibility)

# Original evaluate method still works
score = await evaluator.evaluate(
    context="User wants to reset password",
    expected_output="Provide reset link instructions",
    executed_output=["Click 'Forgot Password' on login page"],
    guardrails="Always mention link expiration time"
)

print(f"Evaluation score: {score}")

Synchronous Usage

from qyrusai import SyncQyrusAI

# All methods are available in synchronous versions
client = SyncQyrusAI(api_key="your_api_key")
evaluator = client.llm_evaluator.evaluator

# Synchronous RAG evaluation
result = evaluator.evaluate_rag(rag_request)

# Synchronous batch evaluation
batch_result = evaluator.evaluate_batch([rag_request, mcp_request])

Advanced MCP with Schema Validation

# MCP with automatic argument validation
mcp_with_schema = MCPRequest(
    app_name="advanced_mcp_app",
    qid="test_004",
    question="Search for customer orders",
    answer="Found 5 orders for customer John Doe.",
    tool_schemas={
        "database.query": {
            "type": "object",
            "properties": {
                "sql": {"type": "string"},
                "timeout": {"type": "number", "minimum": 1}
            },
            "required": ["sql"]
        }
    },
    tools=[
        {
            "name": "database.query",
            "args": {"sql": "SELECT * FROM orders WHERE customer_name='John Doe'"},
            # args_valid will be automatically computed from schema
            "status": "ok",
            "latency_ms": 245,
            "result_text": "Found 5 orders"
        }
    ]
)

result = await evaluator.evaluate_mcp(mcp_with_schema)

Note: The legacy LLM Evaluator (the original judge-based evaluation) is accessible via REST APIs. The new RAG (Retrieval-Augmented Generation) and MCP (Model Context Protocol / tool-calling) testing capabilities are available in the Python SDK only at this time and are not yet exposed via the REST API. These RAG and MCP features will be made available via REST APIs soon.

PreviousGenerate Test Data NextSetUp

Last updated 3 months ago

hashtagUsage 🚀

hashtagJAVA

hashtagUsing SyncClient

hashtagUsing AsyncClient

hashtagPYTHON

hashtagUsing SyncQyrusAI

hashtagUsing AsyncQyrusAI

hashtagPython-only: RAG and MCP Testing

hashtag