Managing large, heterogeneous IT systems is a complex task, where IT operators are expected to resolve incidents quickly. They often have to navigate fragmented documentation and rely heavily on prior experience. In cases of critical system outages, the business impact can be substantial. In the context of e-government, service disruptions can directly affect citizens and oftentimes have very demanding SLA requirements.
To address these challenges, within the ICT & IT National Laboratory project DONNA was developed as a Retriever-Augmented Generation (RAG) based digital assistant designed to streamline IT operations support. DONNA integrates conversational interaction, historical knowledge retrieval, and guided reasoning into a unified, agent-driven workflow. This empowers IT operators to rapidly discover and apply effective solutions to operational incidents.
In large enterprise environments, operators are responsible for monitoring and maintaining hundreds of systems. When problems arise, they typically rely on ticketing systems to record incidents, track progress, and coordinate remediation. However, relevant knowledge is often scattered across:
While RAG architectures are well-suited for surfacing such information, deploying them effectively in enterprise environments is challenging. Operators need more than just similar past tickets, they require contextually enriched answers, actionable recommendations, and sometimes step-by-step procedural guidance that integrates insights from multiple sources.
Recent advances in open-source large language models (LLMs) have shifted this landscape. LLMs can now achieve performance levels previously restricted to massive proprietary architectures. This enables the development of specialized, domain-focused applications like DONNA, which can run cost-effectively on self-hosted infrastructure and transform IT operations.
System Architecture
DONNA is designed as a self-hosted, GPU-enabled, Kubernetes-deployed system built entirely with open-source components. Its agentic workflow engine includes:
DONNA offers a web-based chatbot interface, similar to ChatGPT, which operators can customize for their workflows.
Implementation presented several technical and operational challenges:
The proof-of-concept validated that an open-source, self-hosted RAG and agent architecture can deliver measurable efficiency gains for IT operations. Key takeaways include:
Future work will focus on integrating automated log analysis for proactive anomaly detection, as explored in our paper “Anomaly Detection Algorithms for Real-Time Log Data Analysis at Scale” (https://ieeexplore.ieee.org/document/11105402) , and formally evaluating DONNA’s impact on incident resolution times in production environments.