Evals & Logs

Santiago Cardona Updated by Santiago Cardona

Turn.io integrates with Maxim to give you real-time monitoring and evaluation of your AI-powered journeys. Once connected, every AI interaction in your journeys is automatically traced and sent to Maxim, where you can review logs, run evaluations, and improve your AI agents over time.

What are AI Logs?

AI Logs capture every interaction between your users and the AI blocks in your Journeys. Each log records what the user said, how the AI responded, and the underlying details like which model was used and how long it took. This gives you full visibility into how your AI agents are performing in production.

In Maxim, logs are organized as:

  • Traces — Individual AI interactions (a single question and response).
  • Sessions — Full multi-turn conversations that group related traces together.

What are AI Evaluations?

Evaluations let you automatically assess the quality of your AI responses. Instead of manually reading through every conversation, you can set up evaluators that score responses for things like bias, clarity, relevance, tone, or safety.

Maxim supports automated evaluations that run continuously on your logs, so you can catch quality issues early and track improvements over time. You can learn more about setting up evaluations in Maxim's evaluation guide.

Prerequisites

Before you begin, you'll need:

  1. Maxim account. Sign up if you don't have one yet.
  2. Your Maxim API Key and Repository ID — both found in your Maxim dashboard.

Connect Maxim to Turn.io

Step 1: Sign up to Maxim

First, you need to sign up to Maxim. Just hit the button below to sign up:

Step 2: Create a repository

Repositories are where Logs live. On the main navigation, click on Logs:

And then Create a new Repository. The default settings are fine.

Step 3: Get your Repository ID

On the repository you just created, find the (...) button and click on Copy ID. It looks something like this: cmkys49u1027tnsmqkw22tjqg.

Copy & paste this somewhere, you'll need this later in this process.

Step 4: Get your API Key

You can find this in your Maxim dashboard under Settings.

Copy and paste this somewhere, you'll need this on the next step.

Step 5: On Turn.io, Integrate with Maxim

Head over to Settings → AI in the sidebar, then click Evals on the top.

Now, paste the Repository ID and API key you copied earlier.

Step 6: Save

Once we checked if everything is OK, click Save.

And you're done! 🎉 From now on, Turn.io will automatically send AI logs to Maxim. If you're already using AI in Journeys and people are using your journeys, you should start seeing them appear in Maxim in a few minutes.

Edit or remove the integration

To edit your API key, open the Maxim integration modal and click the Edit button next to the masked key. Enter your new key and save.

To remove the integration entirely, open the modal and click Delete. This will stop sending AI traces to Maxim.

What happens next?

Once the integration is active, every AI interaction in your journeys — text generation, classification, AI agent conversations — is automatically logged to your Maxim repository.

From there, you can:

  • Browse logs to see exactly what your AI agents are saying to users.
  • Set up evaluations to automatically score response quality using LLM-as-a-Judge or human review.
  • Run simulations to test and iterate on your prompts before deploying changes.

For more on getting the most out of Maxim, visit the Maxim documentation.

Browse Logs

Once your Maxim integration is active, every AI interaction from your journeys is automatically exported as a log entry. Here's how to find, navigate, and make sense of your logs in Maxim.

Where to find your logs
  1. Open your Repository in the Maxim dashboard.
  2. Click the Logs tab. You'll see a table of all ingested AI interactions.

Each row in the table shows a summary of one trace:

Column

What it shows

Timestamp

When the AI interaction happened

Name

Trace's name, set to your journey block (e.g., "AI Agent" or "Text Generation").

Input

The user's message or prompt

Output

The AI's response

Model

Which AI model was used (e.g. gpt-4oclaude-sonnet-4-5-20250514)

Tokens

Total tokens consumed (input + output)

Cost

Estimated cost in USD for the interaction

Latency

How long the AI took to respond

Tags

Metadata from Turn.io (journey name, block name, etc.)

Understand the log hierarchy

Turn.io organizes AI logs using a hierarchy that maps to how conversations flow through your journeys:

  • Session — A full conversation between a user and your AI agent. All interactions within the same journey session share a session ID.
  • Trace — A single AI "run" within that session. For example, if your AI agent block processes one user message, that's one trace. A multi-turn conversation produces multiple traces grouped under the same session.
  • Generation — The actual LLM call details: the messages sent to the model, the model's response, token counts, and parameters.
  • Span — Individual steps within a trace, such as individual LLM calls or tool executions. These are the building blocks of each trace.

Drill down to the generation details

The most useful information — the exact prompts and responses — is a few clicks deep. Here's how to get there:

  1. In the Logs tab, click on any row to open the trace detail panel.
  2. You'll see a breakdown of all spans within the trace. Look for the root span, which is named after your journey block (e.g. "AI Agent" or "Text Generation").
  3. Click on the root span to expand it. You'll see child spans representing individual events like LLM calls.
  4. In the span detail view, you can see the generation data: the full input messages (system prompt, conversation history, user message), the model's output, token usage, and latency.

This is where you can verify exactly what your AI agent received and how it responded — invaluable for debugging unexpected behavior.

Sessions: following multi-turn conversations

In WhatsApp, users rarely interact with your AI agent just once — they ask a question, get a response, follow up, clarify, and continue the conversation. Sessions in Maxim capture these multi-turn interactions as a single coherent unit, making it easy to review the full arc of a conversation rather than piecing together isolated traces.

How Turn.io maps conversations to sessions

Every AI interaction in a Turn.io journey runs within a journey session. Turn.io automatically exports this session ID to Maxim as session_id, which means:

  • All traces from the same journey run are grouped under the same session.
  • A user's back-and-forth with an AI agent block — every user message and every AI response — ends up as separate traces within one session.
  • When the user starts a new journey, a new session begins.

This mapping mirrors how conversations actually happen: one session = one continuous interaction between a user and your AI agent.

Why sessions matter

Individual traces only tell you about one AI response in isolation. Sessions give you the full conversational context, which is essential when:

  • Debugging drift — An AI agent gradually loses track of what the user wants. You can only see this by looking at the full session, not a single trace.
  • Evaluating coherence — Did the AI stay on topic and remember earlier user inputs? Session-level evaluations answer this.
  • Measuring task success — Did the user actually accomplish their goal by the end of the conversation? That's a session-level question.
  • Reviewing user experience — When a user reports a problem, you want to read the entire conversation, not just the moment things went wrong.
Finding and reviewing sessions

To follow a specific conversation:

  1. In the Logs tab, click on the Sessions button.
  2. All traces of the same journey session will be grouped together in chronological order. Click on any session you want to explore.
  3. Then, click through each trace to see how the conversation evolved — user input, AI response, user input, AI response — along with full generation details at every step.

Filter logs to find what you need

With high-traffic journeys, you'll want to narrow down your logs. Maxim supports filtering using the tags that Turn.io attaches to every trace:

Tag

What it contains

Use it to...

journey_name

The name of the journey that triggered the AI interaction

Filter all AI activity for a specific journey

journey_uuid

The unique ID of the journey. You can get this inside your Journey:

Precisely identify a journey (useful if names change)

journey_block_name

The name of the specific AI block within the journey

Compare performance across different AI blocks

session_id

The conversation session ID

Follow an entire user conversation across multiple traces

context_type

The type of AI evaluation context

Filter by production journeys or simulations

To filter your logs:

  1. In the Logs tab, click the filter controls above the table.
  2. Select the tag you want to filter by (e.g. journey_name).
  3. Enter the value to match (e.g. "Customer Support Bot").
  4. You can combine multiple filters with AND/OR logic to narrow down further — for example, filter by journey_name AND a specific date range.
Common scenarios

"I want to compare how two AI blocks perform" Filter by journey_block_name to isolate logs for each block. Compare latency, token usage, and costs across them.

"I want to see all AI activity for a specific journey" Filter by journey_name or journey_uuid. This shows every AI interaction triggered by that journey, across all users and sessions.

"I want to monitor costs" Sort the Logs table by the Cost column to identify expensive traces. Check which models and journeys are consuming the most tokens.

Tips for working with logs

Use the Overview tab for trends. Before diving into individual logs, check the Overview tab in your repository. It shows aggregate metrics like total traces, latency graphs, and error rates over time — useful for spotting patterns before drilling into specifics.

Set up alerts for errors. In the Alerts tab, configure notifications (Slack, PagerDuty, or Opsgenie) for when error rates spike or latency exceeds a threshold. This way, you don't have to manually check logs every day.

Keep tags clean. The tags Turn.io sends are automatic — you don't need to configure them. But be aware that having descriptive journey and block names in Turn.io makes filtering in Maxim much easier. A block named "AI Agent" is harder to find than "Claims Processing Agent".

Set up Evaluations

Once your logs are flowing into Maxim, you can set up auto evaluations — automated quality checks that continuously score your AI responses as they come in. This means you don't have to manually review every conversation to spot issues.

You can add and configure different types of evaluators in the Evaluators section. Choose the ones you consider relevant for your use case.

Types of evaluators

Maxim offers three categories of evaluators:

  • AI evaluators (LLM-as-a-Judge) use a language model to assess your AI's responses. These are the most flexible and are great for nuanced quality checks:
    • Faithfulness — Is the response grounded in the provided context? Catches hallucinations.
    • Output Relevance — Does the response actually answer what the user asked?
    • Toxicity — Does the response contain harmful, offensive, or inappropriate content?
    • Clarity — Is the response easy to understand?
    • Conciseness — Is the response appropriately brief without losing important information?
    • Task Success — Did the AI accomplish the intended goal?
    • PII Detection — Does the response inadvertently expose personal information?
    • Bias — Does the response show unfair bias?
  • Programmatic evaluators use rule-based logic for deterministic checks:
    • Format validators (valid JSON, valid URL, valid email, etc.)
    • Pattern matching and content analysis (word counts, special characters)
  • Statistical evaluators compare outputs against expected results using metrics like cosine similarity, BLEU, and ROUGE scores. These are useful if you have reference answers to compare against.
How evaluations work

Evaluations run automatically on your logs based on rules you define. Each evaluator scores a specific quality dimension of your AI responses. You can combine multiple evaluators to get a comprehensive quality picture.

Maxim supports evaluations at different levels of granularity:

  • Trace-level — Evaluate individual AI responses. Best for checking if a single answer was accurate, relevant, or safe. This is the most common starting point.
  • Session-level — Evaluate entire multi-turn conversations. Useful for assessing overall conversation quality, coherence, and whether the user's goal was met.
Step-by-step: Set up auto evaluation on logs
  1. Navigate to the Logs section in Maxim and open your Repository.
  2. Click Manage evaluation in the top right corner.
  3. Click Add configuration and choose the evaluation level — Trace is the best starting point for most use cases.
  4.  Select the evaluators you want to use.
  5. Map your variables — connect evaluator inputs to your log data. For example, map trace.output to the evaluator's "response" input so it knows what to score. For session-level evaluations, use trace[*].output to reference outputs across the full conversation.
  6. (Optional) Add filter rules to narrow which logs get evaluated. You can filter by model type, error status, tags, latency, token usage, and more. Combine multiple conditions with AND/OR logic.
  7. (Optional) Set a sampling rate to control costs. For example, evaluate 20% of traces instead of all of them — useful for high-traffic journeys.
  8. Click Save. New logs will be evaluated automatically as they arrive.

Reviewing evaluation results

Open any trace in your Maxim logs and click the Evaluation tab. You'll see scores from each evaluator, along with explanations for why the score was given (for AI evaluators).

Tips for effective evaluations

Start small, then expand. Begin with 2–3 evaluators that address your biggest concerns — typically Faithfulness, Output Relevance, and Toxicity. Add more once you're comfortable reading the results.

Use sampling for high-volume journeys. If your journey handles thousands of conversations daily, evaluating every single trace gets expensive. A 10–20% sample rate still gives you strong statistical confidence while keeping costs manageable.

Combine AI and programmatic evaluators. AI evaluators are great for subjective quality, but programmatic evaluators give you deterministic guarantees. For example, if your AI agent returns JSON, add a isValidJSON evaluator alongside your quality checks.

Filter out noise. Use filter rules to skip evaluating error traces or traces from test users. This keeps your quality metrics clean and focused on real user interactions.

Set up alerts. Once your evaluations are running, configure alerts in Maxim to notify you when quality scores drop below a threshold. This turns evaluations from a passive dashboard into an active monitoring system.

For the full reference on evaluation configuration, see Maxim's auto evaluation guide.

Simulating conversations

While you're developing or iterating a prompt, before you ship your changes to your users, it's super helpful to simulate conversations using AI — and then using Evals to check if your changes had a positive (or negative) impact.

To do that, Maxim also provides a simulation feature. You can see how to set it up here: Simulate & Test AI Journeys.

Was this article helpful?

Data Handling and Privacy

Simulate & Test AI Journeys

Contact