Evals & Logs
Updated
by Santiago Cardona
Turn.io integrates with Maxim to give you real-time monitoring and evaluation of your AI-powered journeys. Once connected, every AI interaction in your journeys is automatically traced and sent to Maxim, where you can review logs, run evaluations, and improve your AI agents over time.
What are AI Logs?
AI Logs capture every interaction between your users and the AI blocks in your Journeys. Each log records what the user said, how the AI responded, and the underlying details like which model was used and how long it took. This gives you full visibility into how your AI agents are performing in production.

In Maxim, logs are organized as:
- Traces — Individual AI interactions (a single question and response).
- Sessions — Full multi-turn conversations that group related traces together.
What are AI Evaluations?
Evaluations let you automatically assess the quality of your AI responses. Instead of manually reading through every conversation, you can set up evaluators that score responses for things like bias, clarity, relevance, tone, or safety.

Maxim supports automated evaluations that run continuously on your logs, so you can catch quality issues early and track improvements over time. You can learn more about setting up evaluations in Maxim's evaluation guide.
Prerequisites
Before you begin, you'll need:
- A Maxim account. Sign up if you don't have one yet.
- Your Maxim API Key and Repository ID — both found in your Maxim dashboard.
Connect Maxim to Turn.io
Step 1: Sign up to Maxim
First, you need to sign up to Maxim. Just hit the button below to sign up:
Step 2: Create a repository
Repositories are where Logs live. On the main navigation, click on Logs:

And then Create a new Repository. The default settings are fine.
Step 3: Get your Repository ID
On the repository you just created, find the (...) button and click on Copy ID. It looks something like this: cmkys49u1027tnsmqkw22tjqg.
Copy & paste this somewhere, you'll need this later in this process.

Step 4: Get your API Key
You can find this in your Maxim dashboard under Settings.
Copy and paste this somewhere, you'll need this on the next step.

Step 5: On Turn.io, Integrate with Maxim
Head over to Settings → AI in the sidebar, then click Evals on the top.

Now, paste the Repository ID and API key you copied earlier.
Step 6: Save
Once we checked if everything is OK, click Save.
And you're done! 🎉 From now on, Turn.io will automatically send AI logs to Maxim. If you're already using AI in Journeys and people are using your journeys, you should start seeing them appear in Maxim in a few minutes.
Edit or remove the integration
To edit your API key, open the Maxim integration modal and click the Edit button next to the masked key. Enter your new key and save.
To remove the integration entirely, open the modal and click Delete. This will stop sending AI traces to Maxim.
What happens next?
Once the integration is active, every AI interaction in your journeys — text generation, classification, AI agent conversations — is automatically logged to your Maxim repository.
From there, you can:
- Browse logs to see exactly what your AI agents are saying to users.
- Set up evaluations to automatically score response quality using LLM-as-a-Judge or human review.
- Run simulations to test and iterate on your prompts before deploying changes.
For more on getting the most out of Maxim, visit the Maxim documentation.
Browse Logs
Once your Maxim integration is active, every AI interaction from your journeys is automatically exported as a log entry. Here's how to find, navigate, and make sense of your logs in Maxim.
Where to find your logs
- Open your Repository in the Maxim dashboard.
- Click the Logs tab. You'll see a table of all ingested AI interactions.

Each row in the table shows a summary of one trace:
Column | What it shows |
Timestamp | When the AI interaction happened |
Name | Trace's name, set to your journey block (e.g., "AI Agent" or "Text Generation"). |
Input | The user's message or prompt |
Output | The AI's response |
Model | Which AI model was used (e.g. |
Tokens | Total tokens consumed (input + output) |
Cost | Estimated cost in USD for the interaction |
Latency | How long the AI took to respond |
Tags | Metadata from Turn.io (journey name, block name, etc.) |
Understand the log hierarchy
Turn.io organizes AI logs using a hierarchy that maps to how conversations flow through your journeys:
- Session — A full conversation between a user and your AI agent. All interactions within the same journey session share a session ID.
- Trace — A single AI "run" within that session. For example, if your AI agent block processes one user message, that's one trace. A multi-turn conversation produces multiple traces grouped under the same session.
- Generation — The actual LLM call details: the messages sent to the model, the model's response, token counts, and parameters.
- Span — Individual steps within a trace, such as individual LLM calls or tool executions. These are the building blocks of each trace.

Drill down to the generation details
The most useful information — the exact prompts and responses — is a few clicks deep. Here's how to get there:
- In the Logs tab, click on any row to open the trace detail panel.
- You'll see a breakdown of all spans within the trace. Look for the root span, which is named after your journey block (e.g. "AI Agent" or "Text Generation").
- Click on the root span to expand it. You'll see child spans representing individual events like LLM calls.
- In the span detail view, you can see the generation data: the full input messages (system prompt, conversation history, user message), the model's output, token usage, and latency.

This is where you can verify exactly what your AI agent received and how it responded — invaluable for debugging unexpected behavior.
Sessions: following multi-turn conversations
In WhatsApp, users rarely interact with your AI agent just once — they ask a question, get a response, follow up, clarify, and continue the conversation. Sessions in Maxim capture these multi-turn interactions as a single coherent unit, making it easy to review the full arc of a conversation rather than piecing together isolated traces.

How Turn.io maps conversations to sessions
Every AI interaction in a Turn.io journey runs within a journey session. Turn.io automatically exports this session ID to Maxim as session_id, which means:
- All traces from the same journey run are grouped under the same session.
- A user's back-and-forth with an AI agent block — every user message and every AI response — ends up as separate traces within one session.
- When the user starts a new journey, a new session begins.
This mapping mirrors how conversations actually happen: one session = one continuous interaction between a user and your AI agent.
Why sessions matter
Individual traces only tell you about one AI response in isolation. Sessions give you the full conversational context, which is essential when:
- Debugging drift — An AI agent gradually loses track of what the user wants. You can only see this by looking at the full session, not a single trace.
- Evaluating coherence — Did the AI stay on topic and remember earlier user inputs? Session-level evaluations answer this.
- Measuring task success — Did the user actually accomplish their goal by the end of the conversation? That's a session-level question.
- Reviewing user experience — When a user reports a problem, you want to read the entire conversation, not just the moment things went wrong.
Finding and reviewing sessions
To follow a specific conversation:
- In the Logs tab, click on the Sessions button.
- All traces of the same journey session will be grouped together in chronological order. Click on any session you want to explore.
- Then, click through each trace to see how the conversation evolved — user input, AI response, user input, AI response — along with full generation details at every step.

Filter logs to find what you need
With high-traffic journeys, you'll want to narrow down your logs. Maxim supports filtering using the tags that Turn.io attaches to every trace:
Tag | What it contains | Use it to... |
| The name of the journey that triggered the AI interaction | Filter all AI activity for a specific journey |
| The unique ID of the journey. You can get this inside your Journey: ![]() | Precisely identify a journey (useful if names change) |
| The name of the specific AI block within the journey | Compare performance across different AI blocks |
| The conversation session ID | Follow an entire user conversation across multiple traces |
| The type of AI evaluation context | Filter by production journeys or simulations |
To filter your logs:
- In the Logs tab, click the filter controls above the table.
- Select the tag you want to filter by (e.g.
journey_name). - Enter the value to match (e.g. "Customer Support Bot").
- You can combine multiple filters with AND/OR logic to narrow down further — for example, filter by
journey_nameAND a specific date range.

Common scenarios
"I want to compare how two AI blocks perform" Filter by journey_block_name to isolate logs for each block. Compare latency, token usage, and costs across them.
"I want to see all AI activity for a specific journey" Filter by journey_name or journey_uuid. This shows every AI interaction triggered by that journey, across all users and sessions.
"I want to monitor costs" Sort the Logs table by the Cost column to identify expensive traces. Check which models and journeys are consuming the most tokens.
Tips for working with logs
Use the Overview tab for trends. Before diving into individual logs, check the Overview tab in your repository. It shows aggregate metrics like total traces, latency graphs, and error rates over time — useful for spotting patterns before drilling into specifics.
Set up alerts for errors. In the Alerts tab, configure notifications (Slack, PagerDuty, or Opsgenie) for when error rates spike or latency exceeds a threshold. This way, you don't have to manually check logs every day.
Keep tags clean. The tags Turn.io sends are automatic — you don't need to configure them. But be aware that having descriptive journey and block names in Turn.io makes filtering in Maxim much easier. A block named "AI Agent" is harder to find than "Claims Processing Agent".
Set up Evaluations
Once your logs are flowing into Maxim, you can set up auto evaluations — automated quality checks that continuously score your AI responses as they come in. This means you don't have to manually review every conversation to spot issues.
You can add and configure different types of evaluators in the Evaluators section. Choose the ones you consider relevant for your use case.

Types of evaluators
Maxim offers three categories of evaluators:
- AI evaluators (LLM-as-a-Judge) use a language model to assess your AI's responses. These are the most flexible and are great for nuanced quality checks:
- Faithfulness — Is the response grounded in the provided context? Catches hallucinations.
- Output Relevance — Does the response actually answer what the user asked?
- Toxicity — Does the response contain harmful, offensive, or inappropriate content?
- Clarity — Is the response easy to understand?
- Conciseness — Is the response appropriately brief without losing important information?
- Task Success — Did the AI accomplish the intended goal?
- PII Detection — Does the response inadvertently expose personal information?
- Bias — Does the response show unfair bias?
- Programmatic evaluators use rule-based logic for deterministic checks:
- Format validators (valid JSON, valid URL, valid email, etc.)
- Pattern matching and content analysis (word counts, special characters)
- Statistical evaluators compare outputs against expected results using metrics like cosine similarity, BLEU, and ROUGE scores. These are useful if you have reference answers to compare against.
How evaluations work
Evaluations run automatically on your logs based on rules you define. Each evaluator scores a specific quality dimension of your AI responses. You can combine multiple evaluators to get a comprehensive quality picture.

Maxim supports evaluations at different levels of granularity:
- Trace-level — Evaluate individual AI responses. Best for checking if a single answer was accurate, relevant, or safe. This is the most common starting point.
- Session-level — Evaluate entire multi-turn conversations. Useful for assessing overall conversation quality, coherence, and whether the user's goal was met.
Step-by-step: Set up auto evaluation on logs
- Navigate to the Logs section in Maxim and open your Repository.
- Click Manage evaluation in the top right corner.
- Click Add configuration and choose the evaluation level — Trace is the best starting point for most use cases.
- Select the evaluators you want to use.
- Map your variables — connect evaluator inputs to your log data. For example, map
trace.outputto the evaluator's "response" input so it knows what to score. For session-level evaluations, usetrace[*].outputto reference outputs across the full conversation. - (Optional) Add filter rules to narrow which logs get evaluated. You can filter by model type, error status, tags, latency, token usage, and more. Combine multiple conditions with AND/OR logic.
- (Optional) Set a sampling rate to control costs. For example, evaluate 20% of traces instead of all of them — useful for high-traffic journeys.
- Click Save. New logs will be evaluated automatically as they arrive.
Reviewing evaluation results
Open any trace in your Maxim logs and click the Evaluation tab. You'll see scores from each evaluator, along with explanations for why the score was given (for AI evaluators).

Tips for effective evaluations
Start small, then expand. Begin with 2–3 evaluators that address your biggest concerns — typically Faithfulness, Output Relevance, and Toxicity. Add more once you're comfortable reading the results.
Use sampling for high-volume journeys. If your journey handles thousands of conversations daily, evaluating every single trace gets expensive. A 10–20% sample rate still gives you strong statistical confidence while keeping costs manageable.
Combine AI and programmatic evaluators. AI evaluators are great for subjective quality, but programmatic evaluators give you deterministic guarantees. For example, if your AI agent returns JSON, add a isValidJSON evaluator alongside your quality checks.
Filter out noise. Use filter rules to skip evaluating error traces or traces from test users. This keeps your quality metrics clean and focused on real user interactions.
Set up alerts. Once your evaluations are running, configure alerts in Maxim to notify you when quality scores drop below a threshold. This turns evaluations from a passive dashboard into an active monitoring system.
For the full reference on evaluation configuration, see Maxim's auto evaluation guide.
Simulating conversations
While you're developing or iterating a prompt, before you ship your changes to your users, it's super helpful to simulate conversations using AI — and then using Evals to check if your changes had a positive (or negative) impact.
To do that, Maxim also provides a simulation feature. You can see how to set it up here: Simulate & Test AI Journeys.
