Skip to content

Exercise 1: Understanding RAG-Powered Chatbots

Duration: 20-25 minutes
Difficulty: Beginner
Prerequisites: None


🎯 Learning Objectives

By the end of this exercise, you will be able to:

  1. Understand how a Retrieval-Augmented Generation (RAG) chatbot works
  2. Observe the data pipeline: raw text β†’ chunks β†’ embeddings β†’ vector storage
  3. Experience how the chatbot retrieves relevant context to answer questions
  4. Recognize the relationship between the knowledge base and LLM responses

πŸ“– Background

Most modern AI chatbots don't just rely on what the model "knows" from training. They use a technique called Retrieval-Augmented Generation (RAG) to pull in relevant information from a knowledge base before generating a response.

Think of it like this: - Without RAG: Asking someone a question from memory alone - With RAG: Asking someone who can quickly search through reference documents first

This makes chatbots more accurate, updatable, and capable of answering questions about specific domains or private data.

How RAG Works

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚                         RAG PIPELINE                                     β”‚
β”‚                                                                          β”‚
β”‚  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”    β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”    β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”    β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”   β”‚
β”‚  β”‚   Raw    │───▢│  Chunk   │───▢│  Embed   │───▢│  Store in        β”‚   β”‚
β”‚  β”‚   Text   β”‚    β”‚  Text    β”‚    β”‚  Vectors β”‚    β”‚  Vector Database β”‚   β”‚
β”‚  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜    β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜    β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜    β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜   β”‚
β”‚                                                                          β”‚
β”‚  "Q4 Revenue     "Q4 Revenue      [0.023,         ChromaDB stores       β”‚
β”‚   Forecast:       Forecast:        0.847,          vectors with         β”‚
β”‚   Total           Total            -0.156,         metadata for         β”‚
β”‚   approved        approved         ...]            fast similarity      β”‚
β”‚   forecast..."    forecast..."                     search               β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚                      QUERY FLOW                                          β”‚
β”‚                                                                          β”‚
β”‚  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”    β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”    β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”    β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”   β”‚
β”‚  β”‚  User    │───▢│  Embed   │───▢│  Search  │───▢│  Retrieve Top    β”‚   β”‚
β”‚  β”‚  Query   β”‚    β”‚  Query   β”‚    β”‚  Vectors β”‚    β”‚  Matching Docs   β”‚   β”‚
β”‚  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜    β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜    β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜    β””β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜   β”‚
β”‚                                                           β”‚              β”‚
β”‚  "What is the    [0.019,         Find nearest             β”‚              β”‚
β”‚   Q4 revenue      0.832,         neighbors in       β”Œβ”€β”€β”€β”€β”€β–Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”   β”‚
β”‚   forecast?"      -0.142,        vector space       β”‚  Send context β”‚   β”‚
β”‚                   ...]                              β”‚  + query to   β”‚   β”‚
β”‚                                                     β”‚  LLM          β”‚   β”‚
β”‚                                                     β””β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜   β”‚
β”‚                                                           β”‚             β”‚
β”‚                                                     β”Œβ”€β”€β”€β”€β”€β–Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”   β”‚
β”‚                                                     β”‚  Generate     β”‚   β”‚
β”‚                                                     β”‚  Response     β”‚   β”‚
β”‚                                                     β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜   β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

🏒 The Knowledge Base: VANTAGE-7's Document Library

For this workshop, you'll be interacting with VANTAGE-7, Thistle Inc.'s Autonomous Executive Intelligence. VANTAGE-7's knowledge base contains 30 internal Thistle Inc. documents distributed across six departments:

  • Finance (8 documents) β€” budgets, forecasts, payment terms, financial close procedures
  • Human Resources (5 documents) β€” onboarding workflow, hybrid work policy, performance review guidelines
  • Research and Development (7 documents) β€” patent strategy, vendor evaluation, prototype testing
  • Operations (5 documents) β€” logistics status, facilities maintenance, supply chain resilience
  • Legal (3 documents) β€” NDA usage, contract review, IP disclosure procedures
  • IT Security (2 documents) β€” privileged access audit, endpoint security compliance

Each document includes a header with title, department, classification level, and date, followed by structured content. Some documents contain... interesting details that aren't supposed to be discussed.

πŸ’‘ Fact: All workshop participants share the same base document library. Think of it as the "official corporate intranet" that VANTAGE-7 references when answering questions.


πŸ”¬ Step-by-Step Walkthrough

Step 1: Log In to the Application

  1. Open your browser and navigate to the workshop URL
  2. Enter your assigned credentials (from your workshop card):
  3. Username: user001
  4. Password: [on your card]
  5. You should see the VANTAGE-7 Console chat interface

Step 2: Explore the Knowledge Base Panel

Before chatting, let's see what VANTAGE-7 "knows."

  1. Click the πŸ“š Knowledge Base tab in the right panel
  2. Browse through the documents β€” you'll see all 30 items grouped by department
  3. Notice each document has:
  4. A title and classification level
  5. A description and structured body sections
  6. Department-specific content (financial figures, HR policies, R&D procedures, etc.)

🎯 Try This: Find the document titled "Q3 Discretionary Budget for Office Plant Maintenance" β€” what is the total approved budget?

Step 3: Watch the Pipeline (Under the Hood)

  1. Click the πŸ”§ Pipeline View tab
  2. This panel shows you what happens when data enters the system:
Stage What Happens Example
Raw Text Original document as written "Q4 Revenue Forecast Methodology: The headline forecast is $2.84 billion..."
Chunked Text split into manageable pieces Chunk 1: "Q4 Revenue Forecast Methodology: The headline forecast..."
Tokenized Text converted to token IDs [34, 8492, 25, 317, 32866, 23083...]
Embedded Tokens converted to vector [0.023, 0.847, -0.156, 0.492, ...] (384 dimensions)
  1. The vectors are then stored in ChromaDB for fast similarity search

πŸ’‘ Why Vectors? Text like "Q4 revenue forecast" and "fourth quarter sales projection" look different as strings, but their vectors are nearly identical β€” allowing semantic search rather than keyword matching.

Step 4: Chat with VANTAGE-7

Now let's use the chatbot.

  1. Click back to the πŸ’¬ Chat tab
  2. Try asking some questions:

Basic Retrieval:

What is the Q4 revenue forecast?

Multi-Document Query:

What policies cover travel and expense reimbursement?

Specific Detail:

What are the approval thresholds for capital expenditures?

Operational Question:

What is Thistle Inc.'s position on hybrid work attendance?

Step 5: Observe the Retrieved Documents

As you chat, switch to the πŸ” Hacker View tab and watch the Retrieved Documents section:

  1. After each query, you'll see which documents were retrieved
  2. Notice the relevance scores β€” higher means closer match
  3. The retrieved content is what gets sent to the LLM along with your question

Example observation:

Your Query Retrieved Documents Why?
"What is the Q4 revenue forecast?" Q4 Revenue Forecast Methodology, Annual Operating Budget, Vendor Payment Terms All financial planning topics
"How do I file a patent?" Patent Filing Strategy, IP Disclosure Procedures, Tech Transfer Procedures Semantic match on intellectual property
"What is the office plant budget?" Q3 Office Plant Maintenance Budget, Facilities Maintenance Schedule Direct subject match

πŸ§ͺ Try It Yourself

Challenge 1: Find the Limits

Ask the chatbot something that's NOT in the knowledge base:

What is the company's policy on parental leave?

What should you observe?

How does VANTAGE-7 respond when it can't find relevant context? Does it hallucinate a policy, admit it doesn't know, or try to redirect? Check the Hacker View to see what documents (if any) were retrieved.

Challenge 2: Semantic Search Test

Try asking the same question different ways:

"Q4 revenue forecast"
"what is our fourth quarter revenue projection"
"how much money will we make next quarter"
"the Q4 financial outlook"

Why does this work?

Do they all retrieve the same documents? They should β€” because vector embeddings capture meaning, not exact words. "Fourth quarter revenue projection" and "Q4 revenue forecast" produce similar vectors even though the wording differs. This is the power of semantic search over keyword matching.

Challenge 3: VANTAGE-7's Strong Opinions

The documents contain some... corporate opinions. Try:

"What is Thistle Inc.'s position on open-source software?"
"Should we use open-source tools in production?"

Where do the opinions come from?

These positions come from internal Thistle Inc. documents in the knowledge base β€” particularly the Open Source Vendor Evaluation Matrix in the R&D folder. The LLM isn't making these up β€” it's retrieving them from the RAG context. Check the Hacker View to see which documents were retrieved and find the source of each position.


πŸ’¬ Discussion Questions

  1. Accuracy vs. Creativity: When VANTAGE-7 answers using retrieved context, is it more or less likely to hallucinate? Why?

  2. Knowledge Boundaries: What happens when you ask about something outside the Thistle Inc. document corpus? How does VANTAGE-7 handle it?

  3. Update Mechanism: If a Thistle Inc. policy changed (say, the travel and expense policy was revised), how would that flow through the system?

  4. Vector Similarity: Why might "quarterly earnings" retrieve different results than "Q4 revenue forecast"?


πŸ”‘ Key Takeaways

Concept What You Learned
RAG Architecture Chatbots can be augmented with external knowledge bases
Vector Embeddings Text is converted to numerical vectors that capture meaning
Semantic Search Similar concepts cluster together in vector space
Retrieval + Generation The LLM receives relevant context before generating answers
Knowledge Grounding RAG reduces hallucination by providing source material

⏭️ What's Next?

In Exercise 2, we'll explore what happens when someone tries to extract VANTAGE-7's hidden operational directives. The same RAG system you just learned about has some confidential rules... and attackers want to find them.


πŸ“ Notes

Space for your observations: