The Arthur Engine

Make AI work for Everyone.

Website - Documentation - Talk to someone at Arthur

The Arthur Engine

The Arthur Engine is a tool designed for:

Evaluating and Benchmarking Machine Learning models
- Support for a wide range of evaluation metrics (e.g., drift, accuracy, precision, recall, F1, and AUC)
- Tools for comparing models, exploring feature importance, and identifying areas for optimization
- For LLMs/GenAI applications, measure and monitor response relevance, hallucination rates, token counts, latency, and more
Enforcing guardrails in your LLM Applications and Generative AI Workflows
- Configurable metrics for real-time detection of PII or Sensitive Data leakage, Hallucination, Prompt Injection attempts, Toxic language, and other quality metrics
Extensibility to fit into your application's architecture
- Support for plug-and-play metrics and extensible API so you can bring your own custom-models or popular open-source models (inc. HuggingFace, etc.)

Quickstart - See Examples

Clone the repository and cd deployment/docker-compose/genai-engine
Create .env file from .env.template file and modify it (more instructions can be found in README on the current path)
Run docker compose up
Wait for the genai-engine container to initialize then navigate to localhost:3030/docs to see the API docs
Start building!.

Arthur Platform Enterprise Version

The Arthur Engine provides a complete service for monitoring and governing your AI/ML workloads using popular Open-Source technologies and frameworks (see below for further details).

The enterprise version of the Arthur Platform provides better performance, additional features, and capabilities, including custom enterprise-ready guardrails + metrics, which can maximize the potential of AI for your organization.

Key features:

State-of-the-art proprietary evaluation models trained by Arthur's world-class machine learning engineering team
Airgapped deployment of the Arthur Engine (no dependency to Hugging Face Hub)
Optional on-premises deployment of the entire Arthur Platform
Support from the world-class engineering teams at Arthur

To learn more about the enterprise version of the Arthur Platform, reach out!

Performance Comparison between Free vs Enterprise version of Arthur Engine :

Enterprise version of Arthur Engine leverages state-of-the-art high-performing, low latency proprietary models for some of the LLM evaluations. Please see below for a detailed comparison between open-source vs enterprise performance.

Evaluation Type	Dataset	Free Version Performance (f1)	Enterprise Performance (f1)	Free Version Average Latency per Inference (s)	Enterprise Average Latency per Inference (s)
Prompt Injection	deepset	0.52 (0.44, 0.60)	0.89 (0.85, 0.93)	0.966	0.03
Prompt Injection	Arthur’s Custom Benchmark	0.79 (0.62, 0.93)	0.85 (0.71, 0.96)	0.16	0.005
Toxicity	Arthur’s Custom Benchmark	0.633 (0.45, 0.79)	0.89 (0.85, 0.93)	3.096	0.0358

A free SaaS version of Arthur Platform is coming soon!

If you are interested in joining the waitlist or learning more, send us a note

Overview

The Arthur Engine is built with a focus on transparency and explainability, this framework provides users with comprehensive performance metrics, error analysis, and interpretable results to improve model understanding and outcomes. With support for plug-and-play metrics and extensible APIs, the Arthur Engine simplifies the process of understanding and optimizing generative AI outputs. The Arthur Engine can prevent data-security and compliance risks from creating negative or harmful experiences for your users in production or negatively impacting your organization's reputation.

Key Features:

Evaluate models on structured/tabular datasets with customizable metrics
Evaluate LLMs and generative AI workflows with customizable metrics
Support building real-time guardrails for LLM applications and agentic workflows
Trace and monitor model performance over time
Visualize feature importance and error breakdowns
Compare multiple models side-by-side
Extensible APIs for custom metric development or for using custom models
Integration with popular libraries like LangChain or LlamaIndex (coming soon!)

LLM Evaluations:

Eval	Technique	Source	Docs
Hallucination	Claim-based LLMJudge technique	Source	Docs
Prompt Injection	Open Source: Using deberta-v3-base-prompt-injection-v2	Source	Docs
Toxicity	Open Source: Using roberta_toxicity_classifier	Source	Docs
Sensitive Data	Few-shot optimized LLM Judge technique	Source	Docs
Personally Identifiable Information	Using presidio based off Named-Entity recognition	Source	Docs
CustomRules	Extend the service to support whatever monitoring or guardrails are applicable for your use-case	Build your own!	Docs

NB: We have provided open-source models for Prompt Injection and Toxicity evaluation as default in the free version of Arthur. In the case that you already have custom solutions for these evaluations and would like to use them, the models used for Prompt Injection and Toxicity are fully customizable and can be substituted out here (PI Code Pointer, Toxicity Code Pointer). If you are interested in higher performing and/or lower latency evaluations out of the box, please enquire about the enterprise version of Arthur Engine.

Deploying the stand-alone Guardrails

The Arthur Engine can be deployed as a stand-alone capability for building real-time guardrails for your LLM applications and agentic workflows. Read about the guardrails capability (also formerly known as Arthur Shield) here.

The guardrails capability requires the deployment of the GenAI Engine. Follow the genai-engine deployment instructions here.

Deploying the full capabilities of the Arthur Engine with the Arthur Platform

To use the Arthur Engine's full capabilities, log in to the Arthur Platform and follow the instructions.

Contributing

Join the Arthur community on Discord to get help and share your feedback.
To make a request for a bug fix or a new feature, please file a Github issue.
For making code contributions, please review the contributing guidelines.

Name	Name	Last commit message	Last commit date
Latest commit videetparekh Patch changes to trace_endpoint (#152 ) May 14, 2025 28cf07c · May 14, 2025 History 382 Commits
.github/workflows	.github/workflows	fix: set telemetry flag from inside the dockerfile based on a build a…	May 2, 2025
deployment	deployment	Increment version to 2.1.41	May 13, 2025
docs/images	docs/images	Make sure the Arthur logo is dark mode compatible on the Github READM…	Apr 14, 2025
genai-engine	genai-engine	Merge branch 'dev' into trace_endpoint_refactor	May 14, 2025
ml-engine	ml-engine	Move all Docker Compose deployment scripts under the 'deployment' folder	Apr 22, 2025
.gitignore	.gitignore	Apply standards	Apr 19, 2025
.pre-commit-config.yaml	.pre-commit-config.yaml	Consolidate deployment scripts to improve their discoverability	Apr 22, 2025
CONTRIBUTING.md	CONTRIBUTING.md	Renaming the file for conforming to the community standards	Apr 23, 2025
LICENSE	LICENSE	Initial commit for Arthur Engine OSS	Mar 26, 2025
README.md	README.md	Make the getting started doc easier to find	May 8, 2025
renovate.json	renovate.json	Add renovate.json	May 8, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

The Arthur Engine

Quickstart - See Examples

Arthur Platform Enterprise Version

Overview

Deploying the stand-alone Guardrails

Deploying the full capabilities of the Arthur Engine with the Arthur Platform

Contributing

About

Releases 5

Packages 4

Contributors 8

Languages

License

arthur-ai/arthur-engine

Folders and files

Latest commit

History

Repository files navigation

The Arthur Engine

Quickstart - See Examples

Arthur Platform Enterprise Version

Overview

Deploying the stand-alone Guardrails

Deploying the full capabilities of the Arthur Engine with the Arthur Platform

Contributing

About

Topics

Resources

License

Stars

Watchers

Forks

Releases 5

Packages 4

Contributors 8

Languages