Skip to content

Make AI work for Everyone - Monitoring and governing for your AI/ML

License

Notifications You must be signed in to change notification settings

arthur-ai/arthur-engine

Folders and files

NameName
Last commit message
Last commit date

Latest commit

28cf07c · May 14, 2025
May 2, 2025
May 13, 2025
Apr 14, 2025
May 14, 2025
Apr 22, 2025
Apr 19, 2025
Apr 22, 2025
Apr 23, 2025
Mar 26, 2025
May 8, 2025
May 8, 2025

Repository files navigation

Arthur AI Logo

Make AI work for Everyone.

GenAI Engine CI Discord

Website - Documentation - Talk to someone at Arthur

The Arthur Engine

The Arthur Engine is a tool designed for:

  • Evaluating and Benchmarking Machine Learning models
    • Support for a wide range of evaluation metrics (e.g., drift, accuracy, precision, recall, F1, and AUC)
    • Tools for comparing models, exploring feature importance, and identifying areas for optimization
    • For LLMs/GenAI applications, measure and monitor response relevance, hallucination rates, token counts, latency, and more
  • Enforcing guardrails in your LLM Applications and Generative AI Workflows
    • Configurable metrics for real-time detection of PII or Sensitive Data leakage, Hallucination, Prompt Injection attempts, Toxic language, and other quality metrics
  • Extensibility to fit into your application's architecture
    • Support for plug-and-play metrics and extensible API so you can bring your own custom-models or popular open-source models (inc. HuggingFace, etc.)

Quickstart - See Examples

  1. Clone the repository and cd deployment/docker-compose/genai-engine
  2. Create .env file from .env.template file and modify it (more instructions can be found in README on the current path)
  3. Run docker compose up
  4. Wait for the genai-engine container to initialize then navigate to localhost:3030/docs to see the API docs
  5. Start building!.

Arthur Platform Enterprise Version

The Arthur Engine provides a complete service for monitoring and governing your AI/ML workloads using popular Open-Source technologies and frameworks (see below for further details).

The enterprise version of the Arthur Platform provides better performance, additional features, and capabilities, including custom enterprise-ready guardrails + metrics, which can maximize the potential of AI for your organization.

Key features:

  • State-of-the-art proprietary evaluation models trained by Arthur's world-class machine learning engineering team
  • Airgapped deployment of the Arthur Engine (no dependency to Hugging Face Hub)
  • Optional on-premises deployment of the entire Arthur Platform
  • Support from the world-class engineering teams at Arthur

To learn more about the enterprise version of the Arthur Platform, reach out!

Performance Comparison between Free vs Enterprise version of Arthur Engine :

Enterprise version of Arthur Engine leverages state-of-the-art high-performing, low latency proprietary models for some of the LLM evaluations. Please see below for a detailed comparison between open-source vs enterprise performance.

Evaluation Type Dataset Free Version Performance (f1) Enterprise Performance (f1) Free Version Average Latency per Inference (s) Enterprise Average Latency per Inference (s)
Prompt Injection deepset 0.52 (0.44, 0.60) 0.89 (0.85, 0.93) 0.966 0.03
Prompt Injection Arthur’s Custom Benchmark 0.79 (0.62, 0.93) 0.85 (0.71, 0.96) 0.16 0.005
Toxicity Arthur’s Custom Benchmark 0.633 (0.45, 0.79) 0.89 (0.85, 0.93) 3.096 0.0358

A free SaaS version of Arthur Platform is coming soon!

If you are interested in joining the waitlist or learning more, send us a note

Arthur GenAI Evals

Overview

The Arthur Engine is built with a focus on transparency and explainability, this framework provides users with comprehensive performance metrics, error analysis, and interpretable results to improve model understanding and outcomes. With support for plug-and-play metrics and extensible APIs, the Arthur Engine simplifies the process of understanding and optimizing generative AI outputs. The Arthur Engine can prevent data-security and compliance risks from creating negative or harmful experiences for your users in production or negatively impacting your organization's reputation.

Key Features:

  • Evaluate models on structured/tabular datasets with customizable metrics
  • Evaluate LLMs and generative AI workflows with customizable metrics
  • Support building real-time guardrails for LLM applications and agentic workflows
  • Trace and monitor model performance over time
  • Visualize feature importance and error breakdowns
  • Compare multiple models side-by-side
  • Extensible APIs for custom metric development or for using custom models
  • Integration with popular libraries like LangChain or LlamaIndex (coming soon!)

LLM Evaluations:

Eval Technique Source Docs
Hallucination Claim-based LLMJudge technique Source Docs
Prompt Injection Open Source: Using deberta-v3-base-prompt-injection-v2 Source Docs
Toxicity Open Source: Using roberta_toxicity_classifier Source Docs
Sensitive Data Few-shot optimized LLM Judge technique Source Docs
Personally Identifiable Information Using presidio based off Named-Entity recognition Source Docs
CustomRules Extend the service to support whatever monitoring or guardrails are applicable for your use-case Build your own! Docs

NB: We have provided open-source models for Prompt Injection and Toxicity evaluation as default in the free version of Arthur. In the case that you already have custom solutions for these evaluations and would like to use them, the models used for Prompt Injection and Toxicity are fully customizable and can be substituted out here (PI Code Pointer, Toxicity Code Pointer). If you are interested in higher performing and/or lower latency evaluations out of the box, please enquire about the enterprise version of Arthur Engine.

Deploying the stand-alone Guardrails

The Arthur Engine can be deployed as a stand-alone capability for building real-time guardrails for your LLM applications and agentic workflows. Read about the guardrails capability (also formerly known as Arthur Shield) here.

The guardrails capability requires the deployment of the GenAI Engine. Follow the genai-engine deployment instructions here.

Deploying the full capabilities of the Arthur Engine with the Arthur Platform

To use the Arthur Engine's full capabilities, log in to the Arthur Platform and follow the instructions.

Contributing

  • Join the Arthur community on Discord to get help and share your feedback.
  • To make a request for a bug fix or a new feature, please file a Github issue.
  • For making code contributions, please review the contributing guidelines.