The Arthur Engine is a tool designed for:
- Evaluating and Benchmarking Machine Learning models
- Support for a wide range of evaluation metrics (e.g., drift, accuracy, precision, recall, F1, and AUC)
- Tools for comparing models, exploring feature importance, and identifying areas for optimization
- For LLMs/GenAI applications, measure and monitor response relevance, hallucination rates, token counts, latency, and more
- Enforcing guardrails in your LLM Applications and Generative AI Workflows
- Configurable metrics for real-time detection of PII or Sensitive Data leakage, Hallucination, Prompt Injection attempts, Toxic language, and other quality metrics
- Extensibility to fit into your application's architecture
- Support for plug-and-play metrics and extensible API so you can bring your own custom-models or popular open-source models (inc. HuggingFace, etc.)
Quickstart - See Examples
- Clone the repository and
cd deployment/docker-compose/genai-engine
- Create
.env
file from.env.template
file and modify it (more instructions can be found in README on the current path) - Run
docker compose up
- Wait for the
genai-engine
container to initialize then navigate to localhost:3030/docs to see the API docs - Start building!.
The Arthur Engine provides a complete service for monitoring and governing your AI/ML workloads using popular Open-Source technologies and frameworks (see below for further details).
The enterprise version of the Arthur Platform provides better performance, additional features, and capabilities, including custom enterprise-ready guardrails + metrics, which can maximize the potential of AI for your organization.
Key features:
- State-of-the-art proprietary evaluation models trained by Arthur's world-class machine learning engineering team
- Airgapped deployment of the Arthur Engine (no dependency to Hugging Face Hub)
- Optional on-premises deployment of the entire Arthur Platform
- Support from the world-class engineering teams at Arthur
To learn more about the enterprise version of the Arthur Platform, reach out!
Performance Comparison between Free vs Enterprise version of Arthur Engine :
Enterprise version of Arthur Engine leverages state-of-the-art high-performing, low latency proprietary models for some of the LLM evaluations. Please see below for a detailed comparison between open-source vs enterprise performance.
Evaluation Type | Dataset | Free Version Performance (f1) | Enterprise Performance (f1) | Free Version Average Latency per Inference (s) | Enterprise Average Latency per Inference (s) |
---|---|---|---|---|---|
Prompt Injection | deepset | 0.52 (0.44, 0.60) | 0.89 (0.85, 0.93) | 0.966 | 0.03 |
Prompt Injection | Arthur’s Custom Benchmark | 0.79 (0.62, 0.93) | 0.85 (0.71, 0.96) | 0.16 | 0.005 |
Toxicity | Arthur’s Custom Benchmark | 0.633 (0.45, 0.79) | 0.89 (0.85, 0.93) | 3.096 | 0.0358 |
A free SaaS version of Arthur Platform is coming soon!
If you are interested in joining the waitlist or learning more, send us a note
The Arthur Engine is built with a focus on transparency and explainability, this framework provides users with comprehensive performance metrics, error analysis, and interpretable results to improve model understanding and outcomes. With support for plug-and-play metrics and extensible APIs, the Arthur Engine simplifies the process of understanding and optimizing generative AI outputs. The Arthur Engine can prevent data-security and compliance risks from creating negative or harmful experiences for your users in production or negatively impacting your organization's reputation.
Key Features:
- Evaluate models on structured/tabular datasets with customizable metrics
- Evaluate LLMs and generative AI workflows with customizable metrics
- Support building real-time guardrails for LLM applications and agentic workflows
- Trace and monitor model performance over time
- Visualize feature importance and error breakdowns
- Compare multiple models side-by-side
- Extensible APIs for custom metric development or for using custom models
- Integration with popular libraries like LangChain or LlamaIndex (coming soon!)
LLM Evaluations:
Eval | Technique | Source | Docs |
---|---|---|---|
Hallucination | Claim-based LLMJudge technique | Source | Docs |
Prompt Injection | Open Source: Using deberta-v3-base-prompt-injection-v2 | Source | Docs |
Toxicity | Open Source: Using roberta_toxicity_classifier | Source | Docs |
Sensitive Data | Few-shot optimized LLM Judge technique | Source | Docs |
Personally Identifiable Information | Using presidio based off Named-Entity recognition | Source | Docs |
CustomRules | Extend the service to support whatever monitoring or guardrails are applicable for your use-case | Build your own! | Docs |
NB: We have provided open-source models for Prompt Injection and Toxicity evaluation as default in the free version of Arthur. In the case that you already have custom solutions for these evaluations and would like to use them, the models used for Prompt Injection and Toxicity are fully customizable and can be substituted out here (PI Code Pointer, Toxicity Code Pointer). If you are interested in higher performing and/or lower latency evaluations out of the box, please enquire about the enterprise version of Arthur Engine.
The Arthur Engine can be deployed as a stand-alone capability for building real-time guardrails for your LLM applications and agentic workflows. Read about the guardrails capability (also formerly known as Arthur Shield) here.
The guardrails capability requires the deployment of the GenAI Engine. Follow the genai-engine
deployment instructions here.
To use the Arthur Engine's full capabilities, log in to the Arthur Platform and follow the instructions.
- Join the Arthur community on Discord to get help and share your feedback.
- To make a request for a bug fix or a new feature, please file a Github issue.
- For making code contributions, please review the contributing guidelines.