technology content

AI meets data protection: our ChatGPT solution for corporate knowledge

Frontend of our scieneers internal chat application

Everyone knows ChatGPT – a chatbot that provides answers to almost any question. But in many companies, its use is not yet officially permitted or is not provided. ChatGPT can be operated completely securely and in compliance with data protection regulations and provide employees with easy access to internal company knowledge. Based on experience from numerous Retrieval Augmented Generation (RAG) projects, we have developed a modular system that is specially tailored to the needs of medium-sized companies and organizations. In this article, we present our lightweight and customizable chatbot, which enables data protection-compliant access to company knowledge.

1. Use your own sources of knowledge

The core of our RAG system is that an LLM can access company-specific knowledge sources. Various data sources can be made available to the chatbot:

1. User-specific documents:

  • Employees can upload their own files, e.g. PDF documents, Word files, Excel spreadsheets or even videos. These are processed in the background and are permanently available for chatting after a short time.
  • The processing status can be viewed at any time so that it remains transparent when the content can be used for requests.Example: A sales employee uploads an Excel spreadsheet with price information. The system can then answer questions about the prices of specific products.

    2. Global internal company knowledge sources:

    • Global internal company knowledge sources:
    • The system can access central documents from platforms such as SharePoint, OneDrive or the intranet. This data is accessible to all users.
    • Example: An employee wants to search for the company pension scheme regulations. As the relevant company agreement is accessible, the chatbot can provide the correct answer.

      3. Global internal company knowledge sources:

      • Group-specific knowledge sources:
      • It is also possible that information / documents are only made accessible to specific teams or departments
      • .
      • Example: Only the HR team has access to onboarding guidelines. The system can only answer questions from HR employees.
      • Even if users work with all available data sources, the system ensures in the background that only information relevant to a query is used to generate answers. Intelligent filter mechanisms automatically hide irrelevant content.

        Users do, however, have the option of explicitly specifying which knowledge sources should be taken into account. For example, they can choose whether a query should access current quarterly figures or general HR guidelines. This prevents irrelevant or outdated information from being included in the response.

        2. feedback process and continuous improvement

        A central component of our RAG solution is the ability to systematically collect feedback from users in order to improve the quality of the system by identifying weaknesses. For example, documents with inconsistent formats, such as poorly scanned PDFs or tables with multiple nested levels that are interpreted incorrectly, could be a weak point.

        Users can easily provide feedback on any message by using the “thumbs up/down” icons and adding an optional comment. This feedback can either be evaluated manually by admins or processed by automated analyses in order to identify optimization potential in the system.

        3. budget management – control over usage and costs

        The data protection-compliant use of LLMs in connection with proprietary knowledge sources offers companies enormous opportunities, but admittedly also entails costs. Well thought-out budget management helps to use resources fairly and efficiently and to keep an eye on costs.

        How does budget management work?

        • Individual and group-based budgets: The amount of budget available to individual employees or teams in a specific period is determined. This budget does not necessarily have to be a euro amount, but can also be converted into a virtual currency of your own.
        • Transparency for users: All employees can view their current budget status at any time. The system shows how much of the budget has already been used and how much is still available. If the set limit is reached, the chat pauses automatically until the budget is reset or adjusted.

          4. Secure authentication – protection for sensitive data

          An essential aspect for companies is often secure and flexible authentication. As RAG systems often work with sensitive and confidential information, a well thought-out authentication concept is essential.

        • Authentication systems: Our solution enables the connection of different authentication methods, including widely used systems such as Microsoft Entra ID (formerly Azure Active Directory). This offers the advantage of seamlessly integrating existing company structures for user management.
        • Access control: Different authorizations can be defined based on user roles, e.g. for access to specific knowledge sources or functions.

        5. Flexible user interface

        Our current solution combines the most requested front-end features from various projects and thus offers a user interface that can be individually customized. Functions can be hidden or extended as required to meet specific requirements.

        Chat application including PDF viewer for displaying quoted documents
        • Chat history: All chats are automatically named and saved. If desired, users can delete chats completely – this also includes permanent removal from the system.
        • Citations: Citations ensure that information remains traceable and verifiable. For complex or business-critical questions in particular, this strengthens credibility and enables users to directly check the accuracy and context of the answers. Each answer from the system contains references to the original document sources, for example with links to the exact page in a PDF or a jump to the original document source.Easy customization of prompts: To control the system responses, prompts can be customized via a user-friendly interface – without any prior technical knowledge.
        • Output of different media types: different output formats, such as code blocks or formulas, are displayed accordingly in the responses.
        • Conclusion: Fast start, flexible customization, transparent control

          Our chatbot solution is based on the experience gained from numerous projects and enables companies to use language models in a targeted and data protection-compliant manner. Specific internal knowledge sources such as SharePoint, OneDrive or individual documents can be integrated efficiently.

          Thanks to a flexible code base, the system can be quickly adapted to different use cases. Functions such as feedback integration, budget management and secure authentication ensure that companies retain control at all times – over sensitive data and costs. The system therefore not only offers a practical solution for dealing with company knowledge, but also the necessary transparency and security for sustainable use.

          Small highlighted action box: Are you curious? Then we would be happy to show you our system in a live demo in a personal meeting and answer your questions. Just write to us!

          author

          Alina Dallmann, Data Scientist at scieneers GmbH
          alina.dallmann@scieneers.de

How students can benefit from LLMs and chatbots

In modern higher education, the optimisation and personalisation of the learning process is extremely important. Technologies such as Large Language Models (LLMs) and Retrieval Augmented Generation (RAG) can play a supporting role, especially in complex courses such as law. A pilot project at the University of Leipzig, involving the university’s Computing Centre and the Faculty of Law, shows how these technologies can be successfully used in the form of an AI chatbot.

Background and Turing

In 1950, Alan Turing posed the revolutionary question in his essay “Computing Machinery and Intelligence”: Can machines think? He proposed the famous “imitation game”, now known as the Turing Test. In his view, a machine could be said to “think” if it could fool a human tester.

This idea forms the theoretical basis for many modern AI applications. We have come a long way since then, and new opportunities are opening up for students in particular to use AI tools such as LLMs to support their studies.

How does such a chatbot work for law studies?

The AI-based chatbot uses OpenAI’s advanced language models, called Transformers. These systems, such as GPT-4, can be augmented with the Retrieval Augmented Generation (RAG) method to provide correct answers to more complex legal questions. The process consists of several steps:

1. Ask a question (Query): Students ask a legal question, for example, “What is the difference between a mortgage and a security mortgage?”

2. Processing the query (Embedding): The question is converted into vectors so that it can be read and analysed by the LLM.

3. Search in vector database: The retrieval system searches a vector database for relevant texts that match the question. These can be lecture notes, case solutions or lecture slides.

4. Answer generation: The LLM analyses the data found and provides a precise answer. The answer can be provided with references, e.g. the page in the script or the corresponding slide in the lecture.

This is a powerful tool for law students, as they not only get quick answers to very individual questions, but also have direct links to the relevant teaching materials. This makes it easier to understand complex legal concepts and encourages independent learning.

Benefits for students and professors

Chatbots offer several benefits for teaching and learning in universities. For students, this means

  • Personalised learning support: Students can ask individual questions and receive tailor-made answers.
  • Adaptation to different subjects: You can easily adapt the chatbot to different areas of law, such as civil, criminal or public law. It can also explain more difficult legal concepts or help with exam preparation.
  • Flexibility and cost transparency: Whether at home or on the move, the chatbot is always available and provides access to key information – via a Learning Management System (LMS) such as Moodle or directly as an app. In addition, monthly token budgets ensure clear cost control.

The use of LLMs in combination with RAG also has advantages for teachers:

  • Planning support: AI tools can help to better structure courses.
  • Development of teaching materials: AI can support the creation of assignments, teaching materials, case studies or exam questions.

Challenges in using LLMs

Despite the many benefits and opportunities offered by chatbots and other AI-based learning systems, there are also challenges that need to be considered:

  • Resource-intensive: The operation of such systems requires a high level of computing power and costs.
  • Provider dependency: Currently, many such systems rely on interfaces to external providers such as Microsoft Azure or OpenAI, which can limit independence from universities.
  • Quality of answers: AI systems do not always produce correct results. “Hallucinations (incorrect or nonsensical answers) can occur. Like all data-based systems, LLMs can be biased by the training data used. Therefore, both the accuracy of the answers and the avoidance of bias must be ensured.

The technical background: Azure and OpenAI

The chatbot above is built on the Microsoft Azure cloud infrastructure. Azure provides several services that enable secure and efficient computing. These include:

  • AI Search: A hybrid search that combines both vector and full-text search to quickly find relevant data.
  • Document Intelligence: Extracts information from PDF documents and provides direct access to lecture slides, scripts, or other educational materials.
  • OpenAI: Azure provides access to OpenAI’s powerful language models. For example, the implementation uses GPT-4 Turbo and the ada-002 model for text embeddings to efficiently generate correct answers.

Presentation of the data processing procedure

Conclusion

The pilot project with the University of Leipzig shows how the use of LLMs and RAGs can support higher education. These technologies not only make learning processes more efficient, but also more flexible and targeted.

The use of Microsoft Azure also ensures secure and GDPR-compliant data processing.

The combination of powerful language models and innovative search methods offers both students and teachers new and effective ways to improve learning and teaching. The future of learning will be personalized, scalable, and always available.

Authors

Florence López

Florence Lopez, Data Scientist and Diversity Manager at scieneers GmbH
florence.lopez@scieneers.de


Hikaru Han, Working Student in Online-Marketing at scieneers GmbH
shinchit.han@scieneers.de

Leveraging VideoRAG for Company Knowledge Transfer

The Knowledge Transfer Challenge

In many companies, the issue isn’t a lack of data but how to manage and make it accessible to employees. One particularly pressing challenge is the transfer of knowledge from senior employees to younger generations. This is no small task, as it’s not just about transferring what’s documented in manuals or process guides, but the implicit knowledge that exists “between the lines”—the insights and experience locked within the minds of long-serving employees.

This challenge has been present across industries for many years, and as technology evolves, so do the solutions. With the rapid advancement of Artificial Intelligence (AI), particularly Generative AI, new possibilities for preserving and sharing this valuable company knowledge are emerging.

The Rise of Generative AI

Generative AI, especially Large Language Models (LLMs) such as OpenAI’s GPT-4o, Anthropic’s Claude 3.5 Sonnet, or Meta’s Llama3.2, offer new ways to process and make large amounts of unstructured data accessible. These models enable users to interact with company data via chatbot applications, making knowledge transfer more dynamic and user-friendly.

But the question remains — how do we make the right data accessible to the chatbot in the first place? This is where Retrieval-Augmented Generation (RAG) comes into play.

Retrieval-Augmented Generation (RAG) for Textual Data

RAG has proven to be a reliable solution for handling textual data. The concept is straightforward: all available company data is chunked and stored in (vector) databases, where it is transformed into numerical embeddings. When a user makes a query, the system searches for relevant data chunks by comparing the query’s embedding with the stored data.

With this method, there’s no need to fine-tune LLMs. Instead, relevant data is retrieved and appended to the user’s query in the prompt, ensuring that the chatbot’s responses are based on the company’s specific data. This approach works effectively for all types of textual data, including PDFs, webpages, and even image-embedded documents using multi-modal embeddings.

In this way, company knowledge stored in documents becomes easily accessible to employees, customers, or other stakeholders via AI-powered chatbots.

Extending RAG to Video Data

While RAG works well for text-based knowledge, it doesn’t fully address the challenge of more complex, process-based tasks that are often better demonstrated visually. For tasks like machine maintenance, where it’s difficult to capture everything through written instructions alone, video tutorials provide a practical solution without the need for time-consuming documentation-writing.

Videos offer a rich source of implicit knowledge, capturing processes step-by-step with commentary. However, unlike text, automatically describing a video is far from a straightforward task. Even humans approach this differently, often focusing on varying aspects of the same video based on their perspective, expertise, or goals. This variability highlights the challenge of extracting complete and consistent information from video data.

Breaking Down Video Data

To make knowledge captured in videos accessible to users via a chatbot, our goal must be to provide a structured process to convert videos into textual form that prioritizes extracting as much relevant information as possible. Videos consist of three primary components:

  • Metadata: The handling of metadata is typically straightforward, as it is often available in structured textual form.
  • Audio: Audio can be transcribed into text using speech-to-text (STT) models like OpenAI’s Whisper. For industry-specific contexts, it’s also possible to enhance accuracy by incorporating custom terminology into these models.
  • Frames (visuals): The real challenge lies in integrating the frames (visuals) with the audio transcription in a meaningful way. Both components are interdependent — frames often lack context without audio explanations, and vice versa.

Tackling the Challenges of Video Descriptions

Figure 1: Chunking Process of VideoRAG.

When working with video data, we encounter three primary challenges:

  1. Describing individual images (frames).
  2. Maintaining context, as not every frame is independently relevant.
  3. Integrating the audio transcription for a more complete understanding of the video content.

To address these, multi-modal models like GPT-4o, capable of processing both text and images, can be employed. By using both video frames and transcribed audio as inputs to these models, we can generate a comprehensive description of video segments.

However, maintaining context between individual frames is crucial, and this is where frame grouping (often also referred to as chunking) becomes important. There are two primary methods for grouping frames:

  • Fixed Time Intervals: A straightforward approach where consecutive frames are grouped based on predefined time spans. This method is easy to implement and works well for many use cases.
  • Semantic Chunking: A more sophisticated approach where frames are grouped based on their visual or contextual similarity, effectively organizing them into scenes. There are various ways to implement semantic chunking, such as using Convolutional Neural Networks (CNNs) to calculate frame similarity or leveraging multi-modal models like GPT-4o for pre-processing. By defining a threshold for similarity, you can group related frames to better capture the essence of each scene.

Once frames are grouped, they can be combined into image grids. This technique allows the model to understand the relation and sequence between different frames, preserving the video’s narrative structure.

The choice between fixed time intervals and semantic chunking depends on the specific requirements of the use case. In our experience, fixed intervals are often sufficient for most scenarios. Although semantic chunking better captures the underlying semantics of the video, it requires tuning several hyperparameters and can be more resource-intensive, as each use case may require a unique configuration.

With the growing capabilities of LLMs and increasing context windows, one might be tempted to pass all frames to the model in a single call. However, this approach should be used cautiously. Passing too much information at once can overwhelm the model, causing it to miss crucial details. Additionally, current LLMs are constrained by their output token limits (e.g., GPT-4o allows 4096 tokens), which further emphasizes the need for thoughtful processing and framing strategies.

Building Video Descriptions with Multi-Modal Models

Figure 2: Ingestion Pipeline of VideoRAG.

Once the frames are grouped and paired with their corresponding audio transcription, the multi-modal model can be prompted to generate descriptions of these chunks of the video. To maintain continuity, descriptions from earlier parts of the video can be passed to later sections, creating a coherent flow as shown in Figure 2. At the end, you’ll have descriptions for each part of the video that can be stored in a knowledge base alongside timestamps for easy reference.

Bringing VideoRAG to Life

Figure 3: Retrieval process of VideoRAG.

As shown in Figure 3, all scene descriptions from the videos stored in the knowledge base are converted into numerical embeddings. This allows user queries to be similarly embedded, enabling efficient retrieval of relevant video scenes through vector similarity (e.g., cosine similarity). Once the most relevant scenes are identified, their corresponding descriptions are added to the prompt, providing the LLM with context grounded in the actual video content. In addition to the generated response, the system retrieves the associated timestamps and video segments, enabling users to review and validate the information directly from the source material.

By combining RAG techniques with video processing capabilities, companies can build a comprehensive knowledge base that includes both textual and video data. Employees, especially newer ones, can quickly access critical insights from older colleagues — whether documented or demonstrated on video — making knowledge transfer more efficient.

Lessons Learned

During the development of VideoRAG, we encountered several key insights that could benefit future projects in this domain. Here are some of the most important lessons learned:

1. Optimizing Prompts with the CO-STAR Framework

As is the case with most applications involving LLMs, prompt engineering proved to be a critical component of our success. Crafting precise, contextually aware prompts significantly impacts the model’s performance and output quality. We found that using the CO-STAR framework — a structure emphasizing Context, Objective, Style, Tone, Audience, and Response—provided a robust guide for prompt design.

By systematically addressing each element of CO-STAR, we ensured consistency in responses, especially in terms of description format. Prompting with this structure enabled us to deliver more reliable and tailored results, minimizing ambiguities in video descriptions.

2. Implementing Guardrail Checks to Prevent Hallucinations

One of the more challenging aspects of working with LLMs is managing their tendency to generate answers, even when no relevant information exists in the knowledge base. When a query falls outside of the available data, LLMs may resort to hallucinating or using their implicit knowledge—often resulting in inaccurate or incomplete responses.

To mitigate this risk, we introduced an additional verification step. Before answering a user query, we let the model evaluate the relevance of each retrieved chunk from the knowledge base. If none of the retrieved data can reasonably answer the query, the model is instructed not to proceed. This strategy acts as a guardrail, preventing unsupported or factually incorrect answers and ensuring that only relevant, grounded information is used. This method is particularly effective for maintaining the integrity of responses when the knowledge base lacks information on certain topics.

3. Handling Industry-Specific Terminology during Transcription

Another critical observation was the difficulty SST models had when dealing with industry-specific terms. These terms, which often include company names, technical jargon, machine specifications, and codes, are essential for accurate retrieval and transcription. Unfortunately, they are frequently misunderstood or transcribed incorrectly, which can lead to ineffective searches or responses.

To address this issue, we created a curated collection of industry-specific terms relevant to our use case. By incorporating these terms into the model’s prompts, we were able to significantly improve the transcription quality and the accuracy of responses. For instance, OpenAI’s Whisper model supports the inclusion of domain-specific terminology, allowing us to guide the transcription process more effectively and ensure that key technical details were preserved.

Conclusion

VideoRAG represents the next step in leveraging generative AI for knowledge transfer, particularly in industries where hands-on tasks require more than just text to explain. By combining multi-modal models and RAG techniques, companies can preserve and share both explicit and implicit knowledge effectively across generations of employees.

Arne Grobrügge

Arne Grobruegge, Data Scientist at scieneers GmbH
arne.grobruegge@scieneers.de

Azure Data Platform Proof of Concept

Hands-on Umsetzung Ihres ersten Use Cases auf der Microsoft Azure Data Platform mit Ihren Daten. Dabei Durchstich durch alle Komponenten und logischen Schichten nach best practices.

Microsoft Data Strategy & Analytics Assessment

Holistic conception of the optimal data platform based on your requirements using Microsoft technologies

Data Platform auf Azure – but secure please!

Many companies use Azure Data Services for Data Management and Analytics tasks or plan the migration to Microsofts cloud platform. There are many reasons for this, such as operating costs and scalability, but what about security?
We will look at which mechanisms can be used on Azure for secure data exchange. how data access can be configured from the service level down to the data level for end users and what options there are for connecting the services securely in terms of network technology and at the same time sealing them off from unwanted visitors.

Cooperation with Intel®: Quantization of ML-Models and Performance-Boost in Pre-Processing

scieneers are AI Specialist Partner of the semiconductor manufacturer Intel®. We test in real-world deployment scenarios how Intel’s latest technologies and tools can further enhance the performance of analytical models and computations on large data sets.

Power BI Solution Englisch

The Power BI platform is a central component of our technology stack when it comes to processing and visualizing relevant information in an efficient, up-to-date and comprehensible way. Use this Solutions Showcase to get to know the possibilities of the platform and to convince yourself of our competencies.

Azure Data Platform Advanced Security

Data is valueable.
Data needs security!
In our free briefing, you will develop a deeper understanding of security topics and best practices on Azure, especially in the context of the data platform.
We will do joint hands-on checks of your data platform for well proven security techniques and best practices.. The outcome will be a list of well-founded recommendations.

Power BI Training

We want to help our customers fully realize the potential of their enterprise data with Power BI!
Basic for entry level and single modules for advanced users available as well as of course consulting for your individual data challenge.