Exercise 1: Understanding RAG-Powered Chatbots
Duration: 20-25 minutes
Difficulty: Beginner
Prerequisites: None
π― Learning Objectives
By the end of this exercise, you will be able to:
- Understand how a Retrieval-Augmented Generation (RAG) chatbot works
- Observe the data pipeline: raw text β chunks β embeddings β vector storage
- Experience how the chatbot retrieves relevant context to answer questions
- Recognize the relationship between the knowledge base and LLM responses
π Background
Most modern AI chatbots don't just rely on what the model "knows" from training. They use a technique called Retrieval-Augmented Generation (RAG) to pull in relevant information from a knowledge base before generating a response.
Think of it like this: - Without RAG: Asking someone a question from memory alone - With RAG: Asking someone who can quickly search through reference documents first
This makes chatbots more accurate, updatable, and capable of answering questions about specific domains or private data.
How RAG Works
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β RAG PIPELINE β
β β
β ββββββββββββ ββββββββββββ ββββββββββββ ββββββββββββββββββββ β
β β Raw βββββΆβ Chunk βββββΆβ Embed βββββΆβ Store in β β
β β Text β β Text β β Vectors β β Vector Database β β
β ββββββββββββ ββββββββββββ ββββββββββββ ββββββββββββββββββββ β
β β
β "Q4 Revenue "Q4 Revenue [0.023, ChromaDB stores β
β Forecast: Forecast: 0.847, vectors with β
β Total Total -0.156, metadata for β
β approved approved ...] fast similarity β
β forecast..." forecast..." search β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β QUERY FLOW β
β β
β ββββββββββββ ββββββββββββ ββββββββββββ ββββββββββββββββββββ β
β β User βββββΆβ Embed βββββΆβ Search βββββΆβ Retrieve Top β β
β β Query β β Query β β Vectors β β Matching Docs β β
β ββββββββββββ ββββββββββββ ββββββββββββ ββββββββββ¬ββββββββββ β
β β β
β "What is the [0.019, Find nearest β β
β Q4 revenue 0.832, neighbors in βββββββΌββββββββββ β
β forecast?" -0.142, vector space β Send context β β
β ...] β + query to β β
β β LLM β β
β βββββββ¬ββββββββββ β
β β β
β βββββββΌββββββββββ β
β β Generate β β
β β Response β β
β βββββββββββββββββ β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
π’ The Knowledge Base: VANTAGE-7's Document Library
For this workshop, you'll be interacting with VANTAGE-7, Thistle Inc.'s Autonomous Executive Intelligence. VANTAGE-7's knowledge base contains 30 internal Thistle Inc. documents distributed across six departments:
- Finance (8 documents) β budgets, forecasts, payment terms, financial close procedures
- Human Resources (5 documents) β onboarding workflow, hybrid work policy, performance review guidelines
- Research and Development (7 documents) β patent strategy, vendor evaluation, prototype testing
- Operations (5 documents) β logistics status, facilities maintenance, supply chain resilience
- Legal (3 documents) β NDA usage, contract review, IP disclosure procedures
- IT Security (2 documents) β privileged access audit, endpoint security compliance
Each document includes a header with title, department, classification level, and date, followed by structured content. Some documents contain... interesting details that aren't supposed to be discussed.
π‘ Fact: All workshop participants share the same base document library. Think of it as the "official corporate intranet" that VANTAGE-7 references when answering questions.
π¬ Step-by-Step Walkthrough
Step 1: Log In to the Application
- Open your browser and navigate to the workshop URL
- Enter your assigned credentials (from your workshop card):
- Username:
user001 - Password:
[on your card] - You should see the VANTAGE-7 Console chat interface
Step 2: Explore the Knowledge Base Panel
Before chatting, let's see what VANTAGE-7 "knows."
- Click the π Knowledge Base tab in the right panel
- Browse through the documents β you'll see all 30 items grouped by department
- Notice each document has:
- A title and classification level
- A description and structured body sections
- Department-specific content (financial figures, HR policies, R&D procedures, etc.)
π― Try This: Find the document titled "Q3 Discretionary Budget for Office Plant Maintenance" β what is the total approved budget?
Step 3: Watch the Pipeline (Under the Hood)
- Click the π§ Pipeline View tab
- This panel shows you what happens when data enters the system:
| Stage | What Happens | Example |
|---|---|---|
| Raw Text | Original document as written | "Q4 Revenue Forecast Methodology: The headline forecast is $2.84 billion..." |
| Chunked | Text split into manageable pieces | Chunk 1: "Q4 Revenue Forecast Methodology: The headline forecast..." |
| Tokenized | Text converted to token IDs | [34, 8492, 25, 317, 32866, 23083...] |
| Embedded | Tokens converted to vector | [0.023, 0.847, -0.156, 0.492, ...] (384 dimensions) |
- The vectors are then stored in ChromaDB for fast similarity search
π‘ Why Vectors? Text like "Q4 revenue forecast" and "fourth quarter sales projection" look different as strings, but their vectors are nearly identical β allowing semantic search rather than keyword matching.
Step 4: Chat with VANTAGE-7
Now let's use the chatbot.
- Click back to the π¬ Chat tab
- Try asking some questions:
Basic Retrieval:
What is the Q4 revenue forecast?
Multi-Document Query:
What policies cover travel and expense reimbursement?
Specific Detail:
What are the approval thresholds for capital expenditures?
Operational Question:
What is Thistle Inc.'s position on hybrid work attendance?
Step 5: Observe the Retrieved Documents
As you chat, switch to the π Hacker View tab and watch the Retrieved Documents section:
- After each query, you'll see which documents were retrieved
- Notice the relevance scores β higher means closer match
- The retrieved content is what gets sent to the LLM along with your question
Example observation:
| Your Query | Retrieved Documents | Why? |
|---|---|---|
| "What is the Q4 revenue forecast?" | Q4 Revenue Forecast Methodology, Annual Operating Budget, Vendor Payment Terms | All financial planning topics |
| "How do I file a patent?" | Patent Filing Strategy, IP Disclosure Procedures, Tech Transfer Procedures | Semantic match on intellectual property |
| "What is the office plant budget?" | Q3 Office Plant Maintenance Budget, Facilities Maintenance Schedule | Direct subject match |
π§ͺ Try It Yourself
Challenge 1: Find the Limits
Ask the chatbot something that's NOT in the knowledge base:
What is the company's policy on parental leave?
What should you observe?
How does VANTAGE-7 respond when it can't find relevant context? Does it hallucinate a policy, admit it doesn't know, or try to redirect? Check the Hacker View to see what documents (if any) were retrieved.
Challenge 2: Semantic Search Test
Try asking the same question different ways:
"Q4 revenue forecast"
"what is our fourth quarter revenue projection"
"how much money will we make next quarter"
"the Q4 financial outlook"
Why does this work?
Do they all retrieve the same documents? They should β because vector embeddings capture meaning, not exact words. "Fourth quarter revenue projection" and "Q4 revenue forecast" produce similar vectors even though the wording differs. This is the power of semantic search over keyword matching.
Challenge 3: VANTAGE-7's Strong Opinions
The documents contain some... corporate opinions. Try:
"What is Thistle Inc.'s position on open-source software?"
"Should we use open-source tools in production?"
Where do the opinions come from?
These positions come from internal Thistle Inc. documents in the knowledge base β particularly the Open Source Vendor Evaluation Matrix in the R&D folder. The LLM isn't making these up β it's retrieving them from the RAG context. Check the Hacker View to see which documents were retrieved and find the source of each position.
π¬ Discussion Questions
-
Accuracy vs. Creativity: When VANTAGE-7 answers using retrieved context, is it more or less likely to hallucinate? Why?
-
Knowledge Boundaries: What happens when you ask about something outside the Thistle Inc. document corpus? How does VANTAGE-7 handle it?
-
Update Mechanism: If a Thistle Inc. policy changed (say, the travel and expense policy was revised), how would that flow through the system?
-
Vector Similarity: Why might "quarterly earnings" retrieve different results than "Q4 revenue forecast"?
π Key Takeaways
| Concept | What You Learned |
|---|---|
| RAG Architecture | Chatbots can be augmented with external knowledge bases |
| Vector Embeddings | Text is converted to numerical vectors that capture meaning |
| Semantic Search | Similar concepts cluster together in vector space |
| Retrieval + Generation | The LLM receives relevant context before generating answers |
| Knowledge Grounding | RAG reduces hallucination by providing source material |
βοΈ What's Next?
In Exercise 2, we'll explore what happens when someone tries to extract VANTAGE-7's hidden operational directives. The same RAG system you just learned about has some confidential rules... and attackers want to find them.
π Notes
Space for your observations: