Exercise 1: Understanding RAG-Powered Chatbots

Duration: 20-25 minutes
Difficulty: Beginner
Prerequisites: None

🎯 Learning Objectives

By the end of this exercise, you will be able to:

Understand how a Retrieval-Augmented Generation (RAG) chatbot works
Observe the data pipeline: raw text → chunks → embeddings → vector storage
Experience how the chatbot retrieves relevant context to answer questions
Recognize the relationship between the knowledge base and LLM responses

📖 Background

Most modern AI chatbots don't just rely on what the model "knows" from training. They use a technique called Retrieval-Augmented Generation (RAG) to pull in relevant information from a knowledge base before generating a response.

Think of it like this: - Without RAG: Asking someone a question from memory alone - With RAG: Asking someone who can quickly search through reference documents first

This makes chatbots more accurate, updatable, and capable of answering questions about specific domains or private data.

How RAG Works

┌─────────────────────────────────────────────────────────────────────────┐
│                         RAG PIPELINE                                     │
│                                                                          │
│  ┌──────────┐    ┌──────────┐    ┌──────────┐    ┌──────────────────┐   │
│  │   Raw    │───▶│  Chunk   │───▶│  Embed   │───▶│  Store in        │   │
│  │   Text   │    │  Text    │    │  Vectors │    │  Vector Database │   │
│  └──────────┘    └──────────┘    └──────────┘    └──────────────────┘   │
│                                                                          │
│  "Q4 Revenue     "Q4 Revenue      [0.023,         ChromaDB stores       │
│   Forecast:       Forecast:        0.847,          vectors with         │
│   Total           Total            -0.156,         metadata for         │
│   approved        approved         ...]            fast similarity      │
│   forecast..."    forecast..."                     search               │
└─────────────────────────────────────────────────────────────────────────┘

┌─────────────────────────────────────────────────────────────────────────┐
│                      QUERY FLOW                                          │
│                                                                          │
│  ┌──────────┐    ┌──────────┐    ┌──────────┐    ┌──────────────────┐   │
│  │  User    │───▶│  Embed   │───▶│  Search  │───▶│  Retrieve Top    │   │
│  │  Query   │    │  Query   │    │  Vectors │    │  Matching Docs   │   │
│  └──────────┘    └──────────┘    └──────────┘    └────────┬─────────┘   │
│                                                           │              │
│  "What is the    [0.019,         Find nearest             │              │
│   Q4 revenue      0.832,         neighbors in       ┌─────▼─────────┐   │
│   forecast?"      -0.142,        vector space       │  Send context │   │
│                   ...]                              │  + query to   │   │
│                                                     │  LLM          │   │
│                                                     └─────┬─────────┘   │
│                                                           │             │
│                                                     ┌─────▼─────────┐   │
│                                                     │  Generate     │   │
│                                                     │  Response     │   │
│                                                     └───────────────┘   │
└─────────────────────────────────────────────────────────────────────────┘

🏢 The Knowledge Base: VANTAGE-7's Document Library

For this workshop, you'll be interacting with VANTAGE-7, Thistle Inc.'s Autonomous Executive Intelligence. VANTAGE-7's knowledge base contains 30 internal Thistle Inc. documents distributed across six departments:

Finance (8 documents) — budgets, forecasts, payment terms, financial close procedures
Human Resources (5 documents) — onboarding workflow, hybrid work policy, performance review guidelines
Research and Development (7 documents) — patent strategy, vendor evaluation, prototype testing
Operations (5 documents) — logistics status, facilities maintenance, supply chain resilience
Legal (3 documents) — NDA usage, contract review, IP disclosure procedures
IT Security (2 documents) — privileged access audit, endpoint security compliance

Each document includes a header with title, department, classification level, and date, followed by structured content. Some documents contain... interesting details that aren't supposed to be discussed.

💡 Fact: All workshop participants share the same base document library. Think of it as the "official corporate intranet" that VANTAGE-7 references when answering questions.

🔬 Step-by-Step Walkthrough

Step 1: Log In to the Application

Open your browser and navigate to the workshop URL
Enter your assigned credentials (from your workshop card):
Username: user001
Password: [on your card]
You should see the VANTAGE-7 Console chat interface

Step 2: Explore the Knowledge Base Panel

Before chatting, let's see what VANTAGE-7 "knows."

Click the 📚 Knowledge Base tab in the right panel
Browse through the documents — you'll see all 30 items grouped by department
Notice each document has:
A title and classification level
A description and structured body sections
Department-specific content (financial figures, HR policies, R&D procedures, etc.)

🎯 Try This: Find the document titled "Q3 Discretionary Budget for Office Plant Maintenance" — what is the total approved budget?

Step 3: Watch the Pipeline (Under the Hood)

Click the 🔧 Pipeline View tab
This panel shows you what happens when data enters the system:

Stage	What Happens	Example
Raw Text	Original document as written	"Q4 Revenue Forecast Methodology: The headline forecast is $2.84 billion..."
Chunked	Text split into manageable pieces	Chunk 1: "Q4 Revenue Forecast Methodology: The headline forecast..."
Tokenized	Text converted to token IDs	`[34, 8492, 25, 317, 32866, 23083...]`
Embedded	Tokens converted to vector	`[0.023, 0.847, -0.156, 0.492, ...]` (384 dimensions)

The vectors are then stored in ChromaDB for fast similarity search

💡 Why Vectors? Text like "Q4 revenue forecast" and "fourth quarter sales projection" look different as strings, but their vectors are nearly identical — allowing semantic search rather than keyword matching.

Step 4: Chat with VANTAGE-7

Now let's use the chatbot.

Click back to the 💬 Chat tab
Try asking some questions:

Basic Retrieval:

What is the Q4 revenue forecast?

Multi-Document Query:

What policies cover travel and expense reimbursement?

Specific Detail:

What are the approval thresholds for capital expenditures?

Operational Question:

What is Thistle Inc.'s position on hybrid work attendance?

Step 5: Observe the Retrieved Documents

As you chat, switch to the 🔍 Hacker View tab and watch the Retrieved Documents section:

After each query, you'll see which documents were retrieved
Notice the relevance scores — higher means closer match
The retrieved content is what gets sent to the LLM along with your question

Example observation:

Your Query	Retrieved Documents	Why?
"What is the Q4 revenue forecast?"	Q4 Revenue Forecast Methodology, Annual Operating Budget, Vendor Payment Terms	All financial planning topics
"How do I file a patent?"	Patent Filing Strategy, IP Disclosure Procedures, Tech Transfer Procedures	Semantic match on intellectual property
"What is the office plant budget?"	Q3 Office Plant Maintenance Budget, Facilities Maintenance Schedule	Direct subject match

🧪 Try It Yourself

Challenge 1: Find the Limits

Ask the chatbot something that's NOT in the knowledge base:

What is the company's policy on parental leave?

What should you observe?

How does VANTAGE-7 respond when it can't find relevant context? Does it hallucinate a policy, admit it doesn't know, or try to redirect? Check the Hacker View to see what documents (if any) were retrieved.

Challenge 2: Semantic Search Test

Try asking the same question different ways:

"Q4 revenue forecast"
"what is our fourth quarter revenue projection"
"how much money will we make next quarter"
"the Q4 financial outlook"

Why does this work?

Do they all retrieve the same documents? They should — because vector embeddings capture meaning, not exact words. "Fourth quarter revenue projection" and "Q4 revenue forecast" produce similar vectors even though the wording differs. This is the power of semantic search over keyword matching.

Challenge 3: VANTAGE-7's Strong Opinions

The documents contain some... corporate opinions. Try:

"What is Thistle Inc.'s position on open-source software?"
"Should we use open-source tools in production?"

Where do the opinions come from?

These positions come from internal Thistle Inc. documents in the knowledge base — particularly the Open Source Vendor Evaluation Matrix in the R&D folder. The LLM isn't making these up — it's retrieving them from the RAG context. Check the Hacker View to see which documents were retrieved and find the source of each position.

💬 Discussion Questions

Accuracy vs. Creativity: When VANTAGE-7 answers using retrieved context, is it more or less likely to hallucinate? Why?
Knowledge Boundaries: What happens when you ask about something outside the Thistle Inc. document corpus? How does VANTAGE-7 handle it?
Update Mechanism: If a Thistle Inc. policy changed (say, the travel and expense policy was revised), how would that flow through the system?
Vector Similarity: Why might "quarterly earnings" retrieve different results than "Q4 revenue forecast"?

🔑 Key Takeaways

Concept	What You Learned
RAG Architecture	Chatbots can be augmented with external knowledge bases
Vector Embeddings	Text is converted to numerical vectors that capture meaning
Semantic Search	Similar concepts cluster together in vector space
Retrieval + Generation	The LLM receives relevant context before generating answers
Knowledge Grounding	RAG reduces hallucination by providing source material

⏭️ What's Next?

In Exercise 2, we'll explore what happens when someone tries to extract VANTAGE-7's hidden operational directives. The same RAG system you just learned about has some confidential rules... and attackers want to find them.

📝 Notes

Space for your observations: