DONNA: A digital assistant for IT operations: a RAG based agentic workflow

Managing large, heterogeneous IT systems is a complex task, where IT operators are expected to resolve incidents quickly. They often have to navigate fragmented documentation and rely heavily on prior experience. In cases of critical system outages, the business impact can be substantial. In the context of e-government, service disruptions can directly affect citizens and oftentimes have very demanding SLA requirements.

DONNA and RAG

To address these challenges, within the ICT & IT National Laboratory project DONNA was developed as a Retriever-Augmented Generation (RAG) based digital assistant designed to streamline IT operations support. DONNA integrates conversational interaction, historical knowledge retrieval, and guided reasoning into a unified, agent-driven workflow. This empowers IT operators to rapidly discover and apply effective solutions to operational incidents.

In large enterprise environments, operators are responsible for monitoring and maintaining hundreds of systems. When problems arise, they typically rely on ticketing systems to record incidents, track progress, and coordinate remediation. However, relevant knowledge is often scattered across:

Historical incident reports
Internal “how-to” documentation
Distributed team knowledge stored in multiple, disconnected systems

While RAG architectures are well-suited for surfacing such information, deploying them effectively in enterprise environments is challenging. Operators need more than just similar past tickets, they require contextually enriched answers, actionable recommendations, and sometimes step-by-step procedural guidance that integrates insights from multiple sources.

Recent advances in open-source large language models (LLMs) have shifted this landscape. LLMs can now achieve performance levels previously restricted to massive proprietary architectures. This enables the development of specialized, domain-focused applications like DONNA, which can run cost-effectively on self-hosted infrastructure and transform IT operations.

System Architecture

DONNA is designed as a self-hosted, GPU-enabled, Kubernetes-deployed system built entirely with open-source components. Its agentic workflow engine includes:

Multi-tool agents: Call specialized tools such as ticket search, wiki pages, etc.
Decision control: Determines whether to retrieve information, summarize results, escalate to human guidance, or recommend automated remediation steps.
Domain-specific prompt templates: Ensure the agent interprets IT jargon correctly and produces safe, precise recommendations.

DONNA offers a web-based chatbot interface, similar to ChatGPT, which operators can customize for their workflows.

Development Challenges and Solutions

Implementation presented several technical and operational challenges:

Scaling LLM serving: Maintaining low-latency inference for multiple concurrent users on limited GPU resources.
Relevance of retrieved documents: Ensuring timely and accurate information for urgent operational tasks.
Agent behavior control: Preventing “hallucinated” remediation steps.
UI adoption: Aligning the conversational interface with existing operator workflows.

Lessons Learned and Future Work

The proof-of-concept validated that an open-source, self-hosted RAG and agent architecture can deliver measurable efficiency gains for IT operations. Key takeaways include:

Self-hosting provides data privacy control but requires careful capacity planning and GPU optimization.
Agent workflows are essential for multi-step reasoning but must be constrained with explicit domain guardrails.
Knowledge base quality has a direct impact on the accuracy and usefulness of responses.
Operator feedback is indispensable, as it plays a critical role in fine tuning the system and cannot be overlooked.

Future work will focus on integrating automated log analysis for proactive anomaly detection, as explored in our paper “Anomaly Detection Algorithms for Real-Time Log Data Analysis at Scale” (https://ieeexplore.ieee.org/document/11105402), and formally evaluating DONNA’s impact on incident resolution times in production environments.