Think about the last time someone on your team needed a serious answer fast. Not a document link. Not a keyword match. An actual answer, with sources, context, and enough depth to make a decision from. That request probably landed on a person, took hours, and still came back incomplete.
NVIDIA has released a framework called AI-Q that is designed to close exactly that gap. It is not a product you subscribe to. It is an open source blueprint that your technical teams can set up, connect to your own data, and run on your own infrastructure. The output is not a list of results. It is a written research report, with citations, produced by a system of coordinated AI agents working through the question the same way a team of analysts would.
We’ll look at what AI-Q actually is, how it is built, and what it would take to put it to work.
What Is NVIDIA AI-Q and Why Should You Care
Most search tools inside organizations do one thing: find documents. They return a list of links and leave the work of reading, connecting, and summarizing to the person asking. That gap between "finding" and "knowing" is where a lot of time gets lost.
AI-Q is built to fill that gap. It is an open-source framework that acts more like a research analyst than a search bar. You ask a question. It reads sources, checks its work, gathers more evidence, and hands you a written report with citations. It does this automatically, without a human in the loop.
This is not a chatbot layered on top of a document index. It is a system of coordinated AI agents that plan, search, verify, and write, much the way a team of analysts would divide up a research task.
Two Speeds of Research, One System
AI-Q operates in two modes, and it decides on its own which one a question needs.

Quick answers handle straightforward questions. The system runs a bounded set of searches, pulls citations, and responds in seconds. Think of it as a well-read assistant who can find a fact fast.
Deep research is for harder questions. The system builds a full research plan, breaks it into subtasks, assigns each to a specialist agent, and assembles a long-form report with sourced claims. This is the mode you would use for competitive analysis, regulatory summaries, technical evaluations, or any question where a one-paragraph answer is not enough
From the project's own documentation, here is what the deep research plan looks like structurally. When asked to compare two approaches to information retrieval, the planner produces something like this before any actual research begins:
{
"report_title": "RAG vs Long-Context Models for Enterprise Search",
"report_toc": [
{
"id": "1",
"title": "Architectural Foundations",
"subsections": [
{"id": "1.1", "title": "Retrieval-Augmented Generation Pipeline"},
{"id": "1.2", "title": "Long-Context Transformer Architectures"}
]
},
{
"id": "2",
"title": "Performance and Accuracy Trade-offs",
"subsections": [
{"id": "2.1", "title": "Factual Accuracy and Hallucination Rates"},
{"id": "2.2", "title": "Latency and Throughput Benchmarks"}
]
}
],
"queries": [
{
"id": "q1",
"query": "RAG retrieval-augmented generation architecture components ...",
"target_sections": ["Architectural Foundations"],
"rationale": "Establishes baseline understanding of RAG pipelines"
}
]
}
This planning step is what separates AI-Q from a typical AI chat tool. The system maps out what it needs to learn before it starts searching. Each research agent then works from this plan, not from a raw conversation thread, which keeps the agents focused and prevents them from losing track of what they were supposed to find out.
How Different AI Models Are Used Together
AI-Q does not rely on a single AI model. Different models handle different parts of the work, based on what each model does well.
The configuration file, written in plain YAML, declares which model does what:
llms:
nemotron_llm_non_thinking:
_type: nim
model_name: nvidia/nemotron-3-super-120b-a12b
temperature: 0.7
max_tokens: 8192
chat_template_kwargs:
enable_thinking: false
nemotron_llm:
_type: nim
model_name: nvidia/nemotron-3-super-120b-a12b
temperature: 1.0
max_tokens: 100000
chat_template_kwargs:
enable_thinking: true
gpt-5-2:
_type: openai
model_name: 'gpt-5.2'
nemotron_llm_non_thinking handles fast, direct responses where extra reasoning would only slow things down. nemotron_llm turns on chain-of-thought reasoning and gets a 100,000-token context window for the agents doing multi-step work. gpt-5.2 can act as the orchestrator that manages the overall research flow.
This is significant for a practical reason: you can run the NVIDIA-hosted models on NVIDIA's own infrastructure without needing your own GPU cluster. If you want to keep inference completely on your own systems, you can swap in a self-hosted model instead. The configuration handles it.
Connecting to Your Own Data
The most important capability for most organizations is not web search. It is the ability to point this system at internal data and have it research from there.
AI-Q is built so that adding a new data source does not require changing the core agent code. You write a small connector once, and the agents discover it automatically and use it when relevant. Here is what that looks like in practice, connecting to an internal knowledge base:
class InternalKBConfig(FunctionBaseConfig, name="internal_kb"):
"""Search tool for the internal knowledge base."""
api_url: str = Field(description="Knowledge base API endpoint")
api_key: SecretStr = Field(description="Authentication key")
max_results: int = Field(default=5)
@register_function(config_type=InternalKBConfig)
async def internal_kb(config: InternalKBConfig, builder: Builder):
async def search(query: str) -> str:
"""Search the internal knowledge base for relevant documents."""
results = await call_kb_api(config.api_url, query, config.max_results)
return format_results(results)
yield FunctionInfo.from_fn(search, description=search.__doc__)
Then, in the configuration file, you reference it alongside the other tools:
functions:
internal_kb_tool:
_type: internal_kb
api_url: "https://kb.internal.company.com/api/v1"
api_key: ${INTERNAL_KB_API_KEY}
max_results: 10
deep_research_agent:
_type: deep_research_agent
orchestrator_llm: gpt-5
planner_llm: nemotron_llm
researcher_llm: nemotron_llm
tools:
- advanced_web_search_tool
- internal_kb_tool
The agents pick up the new tool and start using it. The description in the function's docstring tells the AI when to call it. Nothing else needs to change.
Running and Deploying

Getting the system running takes a few commands. After cloning the repository and setting up API keys in an environment file, starting the full stack is a single command:
docker compose -f deploy/compose/docker-compose.yaml up --build
This starts three services together: the research agent backend on port 8000, a PostgreSQL database that tracks research jobs and conversation state, and a web interface on port 3000. For larger deployments, a Helm chart is available for Kubernetes.
The system can also be run from the command line for single queries:
dotenv -f deploy/.env run nat run --config_file configs/config_cli_default.yml \
--input "Summarize the regulatory landscape for data residency in the EU in 2025"
Or run as an API, accepting asynchronous research jobs and returning results when ready. That is useful when the research task takes several minutes, and you do not want to hold a connection open while it runs.
Measuring Whether It Actually Works
Most AI tools skip this part. AI-Q ships with built-in evaluation harnesses that let you test output quality against known benchmarks and measure it over time.
The system has been evaluated on two public research benchmarks, DeepResearch Bench and DeepResearch Bench II, and the specific code branches used to achieve those results are preserved so results can be reproduced.
Running your own evaluation follows a three-step process: generate reports on a benchmark dataset, convert the output to a standard format, and score it. The commands are included in the repository. This matters because teams that deploy AI for knowledge work need a way to detect when output quality drifts, especially after model updates or configuration changes.
Tracing is also built in. Every research query can generate a full execution trace showing exactly which tools were called, in what order, and what came back. This is useful both for debugging unexpected answers and for understanding where time is being spent.
How the Architecture Actually Works
At its core, AI-Q uses a state machine built on LangGraph. A state machine is a system where each step has defined inputs and outputs, and the system moves from one step to the next based on rules, not guesswork. This keeps the overall research process predictable and auditable.
Every query enters through an orchestration node. This node reads the question, classifies what kind of answer is needed (a quick response or a full research job), and routes the work accordingly. It does this in a single step before any research begins, which keeps the system from wasting time on the wrong approach.
For deep research tasks, the orchestration node hands off to two sub-agents that work in sequence. The planner agent produces a structured research outline: a table of contents, a list of queries, and a rationale for each one. The researcher agent then receives only this plan, not the broader conversation or the orchestrator's reasoning. This is intentional. By passing a clean, structured document between agents instead of a long, messy conversation thread, the system avoids a problem common in AI systems: important instructions getting buried and forgotten in a very long context.
Here is what the wiring between sub-agents looks like in code, taken directly from the repository:
from deepagents import create_deep_agent
return create_deep_agent(
model=self.llm_provider.get(LLMRole.ORCHESTRATOR),
system_prompt=orchestrator_prompt,
tools=self.tools,
subagents=self.subagents,
middleware=custom_middleware,
skills=self.skills,
).with_config({"recursion_limit": 1000})
The sub-agents themselves are declared like this:
self.subagents = [
{
"name": "planner-agent",
"system_prompt": render_prompt_template(
self._prompts["planner"], tools=self.tools_info,
),
"tools": self.tools,
"model": self.llm_provider.get(LLMRole.PLANNER),
},
{
"name": "researcher-agent",
"system_prompt": render_prompt_template(
self._prompts["researcher"], tools=self.tools_info,
),
"tools": self.tools,
"model": self.llm_provider.get(LLMRole.RESEARCHER),
},
]
Each agent gets its own model, its own system prompt, and its own tools. Swapping the model for a given role is a one-line change in the configuration file. The prompts are stored as editable templates, so adjusting how the planner structures its outlines or how the researcher frames its queries does not require touching the underlying code.
All of this runs as a configurable workflow defined in YAML. Agents, models, tools, and routing behavior are all declared in one place. If you want to change which model handles deep research, add a new data source, or limit how many search calls an agent can make, you change the config file. Nothing else needs to be redeployed.
What Comes Next
The public roadmap for AI-Q includes several additions that are worth noting. On the safety side, NeMo Guardrails integration is planned, which would add configurable controls over what the agents can access and say. Dynamic model routing, where agents automatically choose the best model for each subtask, is also on the list. So is voice input, resource management with configurable limits on token usage, and collaborative report rewriting with a human in the loop.
These are not announced releases. They are stated intentions from the project's own documentation. But they suggest the direction: more control, more customization, and tighter integration with human workflows over time.
The Practical Question
The AI-Q Blueprint is not a finished product you deploy and forget. It is a framework that requires configuration, a team to connect it to your data sources, and an ongoing process to evaluate whether it is producing useful output. The security documentation makes this explicit: authentication, authorization, logging, and access controls are the responsibility of the teams deploying it.
That said, what NVIDIA has released is a complete, working foundation for the kind of AI-driven research capability that most organizations are still trying to figure out how to build. The benchmarks are real, the code is open, and the design is honest about what it can and cannot do on its own.
For organizations where the cost of slow or incomplete research is high, that combination is worth a serious look.
Resources
NVIDIA AI-Q Blueprint Repository—the full source code, configuration examples, and setup instructions
NVIDIA Developer Blog Tutorial—a step-by-step walkthrough with code examples for connecting enterprise data sources
NVIDIA NeMo Agent Toolkit—the orchestration layer on which the blueprint is built


