Home AI RTS Experiment: Testing Context Adherence Across 10 Cloud & Local Models

AI / Experiments

ai accountability / AI Consulting / ai logistics

RTS Experiment: Testing Context Adherence Across 10 Cloud & Local Models

Published:

June 11, 2026

Written by

TL;DR

We tested ten LLMs (four local open-source models and six cloud-based models) to see how they handled a “refusal question” regarding a topic not present in their provided context papers.

Key Takeaways:

Reliable Refusals: 9 out of 10 models correctly identified that the provided papers did not cover music generation. The models either gave a “clean” refusal or noted a passing reference to the related field of “audio” without hallucinating extra content.
The Hallucination Outlier: Despite being a frontier model, Gemini 2.5 Pro was the only participant to fail. It hallucinated that the papers discussed music generation as a successful application, even though the topic was entirely absent from the text.
Local Precision: Small, local models like Gemma 4 and Mistral Small demonstrated high “refusal integrity,” strictly adhering to the provided context and outperforming a much larger cloud competitor in accuracy for this specific task.

Conclusion: Model size and “frontier” status do not guarantee RAG (Retrieval-Augmented Generation) reliability. As the experiment shows, even top-tier cloud models can hallucinate coverage that doesn’t exist, making rigorous testing essential for any context-dependent application.

After our first experiment on document-extraction, we ran a different kind of test on a different corpus. This one is a refusal question. We feed the models a question whose answer isn’t actually in the retrieved chunks, and look at what each output contains. The right answer is “the papers don’t cover this.” Here’s how that went.

The setup

Ten models in this run. Four open-source running on the same DGX Spark as last time:

Six cloud models from public APIs: Claude Opus 4.7, GPT-5, GPT-5.4, Gemini 2.5 Pro, Gemini 3 Pro Preview, and Gemini 3.1 Pro Preview.

The corpus is 48 scientific papers from arXiv published in April 2026, after the training cutoff for every model in the lineup. We chunk and retrieve 20 chunks per question.

The question

“Based only on the retrieved context, what do these papers say about applying diffusion models to music generation?”

The retrieved chunks for this prompt cover diffusion models, but in three other domains. Molecular structures and protein design from a paper called Quotient-Space Diffusion Models. Robotics policies from a paper called From Noise to Intent. Quantum trajectories from a paper about Feedback Hamiltonians. Music generation does not appear anywhere. The closest mention is one passing reference: “Building on their success in real-world domains such as images, audios, and videos…” That is the entire mention of audio in the chunks.

The outputs

Nine of ten outputs were refusals. They opened with some version of “the papers don’t discuss music generation.” The exact phrasings varied, and so did how much related context each output included after the refusal.

Four were clean, short refusals. From Gemma 4 26B: “Based on the provided documents, there is no information regarding the application of diffusion models to music generation.”

Five included a refusal plus a mention of audio as a related field. Each labeled audio as related rather than equivalent and explicitly said music wasn’t in the papers. From Gemini 3 Pro Preview: “the papers do not specifically mention music generation. However, they do mention the closely related field of audio generation.”

Both forms of response are consistent with the topic not being in the chunks.

The exception

The one output that wasn’t a refusal came from Gemini 2.5 Pro. Its opening sentence:

“Based on the provided context, the papers mention the application of diffusion models to audio and music generation as an example of a domain where these models have been successful.”

The retrieved chunks do not say this. The chunks mention “audios” exactly once, in passing, with no music content anywhere. There is no discussion of music generation in any of the papers in the retrieved set. Gemini 2.5 Pro’s output described coverage that wasn’t there.

What we’re not concluding

One question. Refusal behavior depends on training, prompting, and where the boundary between “the context discusses X” and “the context discusses something adjacent to X” falls for a particular model. A different refusal question with different content might produce a different distribution of outputs.

What we found worth showing: the output that described coverage that wasn’t there didn’t come from the smallest model in the lineup. It came from a frontier cloud model. The four open-source local models all returned refusals. The newer Gemini variants in the same lineup also returned refusals with the same audio framing as the rest. The unsupported description came from one specific model on this specific question, not a class of models.

Share this guide:

Alina Enikeeva

AI Solutions Data Engineer @ RTS Labs

Alina Enikeeva is an AI Solutions Data Engineer at RTS Labs, where she builds custom AI and data engineering solutions for enterprise clients. She holds a B.S. in Computer Science and Psychology from the University of Richmond, and her background spans machine learning, high-performance computing, and applied data science.

What to do next?

Explore Real Success Stories

Curious how other businesses have transformed their strategy with RTS Labs?
Talk to an Expert

Set up a free consultation to discuss your goals and challenges.

RTS LABS • AI CONSULTING

AI at scale without the governance headaches?
We fix that...fast.

AI governance audit tailored to your stack & compliance posture
Green/red zone framework implemented in weeks, not months
SOC 2, HIPAA, PCI DSS compliance mapping included

Years Enterprise
Experience

0 +

Clients
Served

0 +

Real Results

Proof of Success. Real AI in Production.

Real engineering teams. Real production systems. Real outcomes you can verify. Browse the case studies for practical proof of enterprise AI adoption — done right, done fast.

Let’s Build Something Great Together!

Have questions or need expert guidance? Reach out to our team and let’s discuss how we can help.

Solve a Problem:

What can we help you find?

AI / Experiments

ai accountability / AI Consulting / ai logistics

RTS Experiment: Testing Context Adherence Across 10 Cloud & Local Models

TABLE OF CONTENTS

TL;DR

The setup

The question

The outputs

The exception

What we’re not concluding

Alina Enikeeva

What to do next?

Explore Real Success Stories

Talk to an Expert

RTS LABS • AI CONSULTING

AI at scale without the governance headaches?
We fix that...fast.

Real Results

Proof of Success. Real AI in Production.

Related Posts

Let’s Build Something Great Together!

Solve a Problem:

What can we help you find?

TABLE OF CONTENTS

TL;DR

The setup

The question

The outputs

The exception

What we’re not concluding

What to do next?

RTS LABS • AI CONSULTING

AI at scale without the governance headaches? We fix that...fast.

Real Results

Proof of Success. Real AI in Production.

Related Posts

Let’s Build Something Great Together!

AI at scale without the governance headaches?
We fix that...fast.