BSAN 765 Reflection Essay: A Semester in Generative AI

Jacob Keiter | BSAN 765 - AI for Business | Spring 2026 | May 15, 2026

Foreword

This essay is structured as a six-article blog series covering what I actually learned in BSAN 765 over the Spring 2026 semester. Each article stands alone and is intended to be readable by a peer who has not taken the course. They share an arc: from "what is inside the model" through "what do we wrap around the model" to "what did I learn when we shipped one." The applied piece in Article 6 draws on Group 6's parking-occupancy project, which used Google's Gemma 4 31B multimodal model running locally via LM Studio on an RTX 5090 against the Kaggle PKLot dataset.

A note on style: I have followed the project writing rules (no em or en dashes, bold reserved for headings and defined terms), which means the prose looks a little different from the average Medium post. I think it reads better that way; you can decide.

Article 1: From transformers to tokens: what actually happens inside a chatbot

When I started BSAN 765 in January, I could use ChatGPT and Claude every day without having any real idea what was happening when I hit "send." Five months later, I cannot un-see the pipeline. This first post is the version of that pipeline I wish someone had walked me through on day one.

The model does not see words

The first surprise was that a large language model never sees the text you type. It sees tokens, which are small chunks of text mapped to integers. OpenAI's GPT-4o uses a tokenizer called o200k_base (about 200,000 distinct tokens); GPT-3.5 uses cl100k_base (about 100,000). Both are Byte-Pair Encoding (BPE) variants. LLaMA uses a different family called SentencePiece, so the same sentence splits differently. The tiktokenizer playground at tiktokenizer.vercel.app makes this concrete: paste any sentence, switch models, and watch the boundaries move.

This sounds like trivia until you hit one of its consequences. The reason large language models famously cannot count the r's in "strawberry" is that "strawberry" is often a single token. The model does not see s-t-r-a-w-b-e-r-r-y; it sees one integer, with no character-level structure available to count. Tokenization is also why your API bill scales the way it does. Every input character, every output character, every system prompt, every retrieved chunk: all of it gets turned into tokens, and you pay per token.

Attention is what made this whole thing possible

Once the input is a sequence of integers, the model runs them through a transformer. The transformer's defining trick is self-attention, which lets each token's representation incorporate information from every other token in the context window. That is what lets a model resolve "it" two paragraphs later, or follow a long instruction. Earlier sequence models (RNNs, LSTMs) processed text strictly left to right, which made long-range dependencies expensive. The transformer trades that cost for memory, which turned out to be a much better trade.

The original transformer paper from 2017 is one of the few research papers I have read this semester that genuinely earned the "this changed everything" reputation. Almost every model anyone is talking about right now (GPT, Claude, Gemini, LLaMA, Gemma, Qwen, DeepSeek) is a transformer underneath.

Parameters are a misleading scoreboard

The other thing I picked up early is that "parameter count" stopped being a useful capability metric somewhere around GPT-4. The series goes 117 million (the original GPT in 2018), 1.5 billion (GPT-2), 175 billion (GPT-3), an estimated 1.8 trillion (GPT-4, reportedly mixture-of-experts), and after that nobody publishes the numbers anymore. Architecture (dense versus mixture-of-experts), training compute, data quality, and post-training (instruction tuning, RLHF, reasoning) now matter more than the raw count. A mixture-of-experts model with 1 trillion total parameters might only activate 50 billion per token. A heavily post-trained 70B dense model can beat a less-trained 200B model on real tasks. The leaderboards have learned this; press releases have not.

What a base model actually is

Before instruction tuning, a base language model just predicts the next token. If you prompt a base GPT-3 with "The capital of France is", it completes "Paris." If you prompt it with "Write me a haiku about parking lots", it might respond with three more "Write me a..." lines because that pattern is more common in the training data. The thing that turns a base model into an assistant is post-training: supervised fine-tuning on instruction-following examples, then reinforcement learning from human feedback to align responses with what people actually want. The base model is the engine. The post-training is the seat belt, the steering wheel, and the rules of the road.

Holding all of that in your head is what makes the rest of the course (prompt engineering, RAG, agents, fine-tuning) feel less like a stack of techniques and more like a coherent toolkit. Everything we covered after Week 4 is a different way to nudge a token-predicting engine toward the answer you actually want.

Article 2: Prompt engineering is software engineering for English

For the first month of this course, I treated prompts the way most people do: type a question, see what comes back, edit if it is wrong. By March I had stopped doing that. Good prompts are not written, they are engineered, and the difference matters once you are building anything more serious than a one-off chat.

The cheapest lever, and where it ends

Prompt engineering is the cheapest of the three big levers we covered (prompting, retrieval, agentic systems). No new infrastructure, no embedding pipeline, no fine-tuning job; just words. That makes it the right first move on almost any problem. It is also the lever that runs out of road the fastest. Reaching for a stricter system prompt is the right answer until it visibly stops paying off, and the moment to notice that ceiling is when your tenth round of prompt edits is producing the same failure mode as the first nine.

Techniques that earned their keep

A few patterns from the prompt engineering lectures actually changed how I write prompts now.

Few-shot examples. Showing the model three or four worked examples before asking it the real question is dramatically more reliable than describing what you want in prose. The model is a pattern-matcher; give it patterns.

Chain-of-thought. Asking the model to "think step by step" before giving its final answer reliably improves multi-step reasoning. Newer "reasoning" models (GPT-5, o-series, Claude with extended thinking, DeepSeek R1) do this internally, so you rarely have to prompt for it explicitly; older or smaller models still benefit.

Role priming. Telling the model "you are a careful JSON formatter that returns only a single JSON object, no preamble" works better than "please respond in JSON." Roles compress a lot of style constraints into a short prefix.

Structured output. Constraining the model to a fixed JSON schema (with concrete fields, types, and an example) makes the output programmatically parseable, which is the difference between a demo and a system. Rajapriya Sambandam, the founder of Peppermint AI who spoke to our class in April, made this point with a side-by-side: free-text wall of words on one side, the same answer rendered into a clean UI on the other. The format is half the product.

Self-critique loops. Asking the model to "give me three questions that would clarify my answer" before answering, the technique John Ware described from his architecture-review work at Archware, surfaces ambiguity that a single-shot prompt would have papered over. Cross-model verification (same prompt, different model, see what changes) is a related habit that is genuinely useful when stakes are high.

What I caught myself doing wrong

Two anti-patterns I had to unlearn. First, I kept stuffing the system prompt with edge-case rules whenever I saw a new failure mode, which slowly turned my prompt into a 2,000-token wall of "and if the user asks X, also do Y." A long system prompt is its own failure mode: the model starts forgetting the rules in the middle of the document, or worse, mis-following them. The fix is to put the rules where they belong: stable instructions in the system prompt, dynamic context in retrieval, and structured constraints in the response schema, not all three in one bloated string.

Second, I trusted my own judgment about whether a prompt "worked." Rajapriya's phrasing reframed this for me: "your prompt can serve 10 customers, but can it serve 100,000?" One person eyeballing two outputs is not evaluation, it is anecdote. Peppermint AI runs about 60% deterministic checks and 40% LLM-as-judge evals on its own production traffic. The boring conclusion I keep coming back to is that prompt engineering without an evaluation harness is hope, not engineering.

Article 3: Vector databases and RAG: why your LLM needs a memory

The single most useful thing I learned in BSAN 765 is the pattern called retrieval-augmented generation (RAG). It is also the lever that, in my opinion, is doing the most actual work in production AI today. This post is the version I would have wanted before Week 8.

The problem RAG solves

A large language model knows what was in its training data, frozen at the cutoff. It does not know what your company published last week, what your customer's order number is, or what the building code in Sedgwick County says about ceiling heights. It also will not admit when it does not know; it will hallucinate plausibly. You can fix this two ways: bake the knowledge into the weights (fine-tuning, expensive, slow to update) or look the knowledge up at query time and put it in the prompt (RAG, cheap, updates instantly when your corpus changes). Almost every "AI chatbot over our internal docs" product you have used is the second one.

What a vector database actually is

A vector database stores embeddings. An embedding is a high-dimensional numeric representation of a piece of content (a text chunk, an image, audio) produced by a model trained to put semantically similar things near each other in vector space. "Laptop" and "computer" land close together; "laptop" and "carrot" do not. To answer a question, you embed the question with the same model, ask the database for the top-k nearest chunks by cosine similarity (or inner product), and put those chunks in the LLM's prompt. The LLM composes an answer grounded in retrieved text.

For Assignment 3 I built one of these against the group project's PKLot image corpus. I sampled 500 timestamped JPEGs stratified across 100 dates, embedded them with CLIP ViT-B/32 (a 512-dimensional vision-language model), stored the vectors in Milvus Lite, and queried with text prompts. "A mostly empty parking lot in the morning" returned five lots with zero occupied spaces. "A crowded parking lot on a sunny afternoon" returned five lots that were 39 or 40 out of 40 full. The retriever was reading occupancy and lighting straight out of the pixels, not the filenames. That was the moment vector search stopped feeling like magic and started feeling like a tool.

Vendor choices

The class touched a half-dozen vector stores and I have now used four of them in different settings. Milvus Lite is a single-file embedded store, perfect for laptops and demos, with the catch that it does not publish Windows wheels (I ran it under WSL Ubuntu). Pinecone is the hosted heavyweight; our group project picked it with ViT-B/16 for the embedding stack on the production-ish path. FAISS is Facebook's library, fast and library-only (no server), which is what most LangChain quickstarts default to. Chroma is the friendly developer-first one. The honest answer is that the choice rarely matters until your corpus crosses about a million chunks; below that, almost anything will work, and the quality of your chunking matters more than the engine.

The three RAG case studies that stuck with me

For Assignment 3 I dug into three production RAG deployments. Morgan Stanley Wealth Management put tens of thousands of analyst reports behind a chat assistant, and grew the answerable corpus from about 7,000 to about 100,000 documents without re-training the LLM. That last detail is the whole thesis of RAG: retrieval scales faster than fine-tuning ever could. Harvey AI is the legal version, where citation traceability is the whole product because hallucinated case law is malpractice. Klarna deployed an RAG-plus-tools customer-service assistant that, by its own reporting, handled two-thirds of chats in its first month, then quietly walked back the headline and re-hired some agents when quality drifted on edge cases. RAG is a force multiplier, not a replacement.

Ashwini Vasanth, Nutanix, and what enterprise RAG actually looks like

Ashwini Vasanth, a product manager at Nutanix who spoke to our class in March, gave me the most realistic picture of enterprise RAG that I have seen. Her observation that the threat landscape scales with system complexity (hallucinations alone at the chatbot stage, then unauthorized tool access at the agent stage, then cascading failures at the multi-agent stage) reframed how I think about deployment risk. She also recommended running smaller open-source models locally through Ollama or LM Studio for any task that does not need a frontier model, especially when the corpus contains anything sensitive. Our group project ended up doing exactly that: Gemma 4 31B running locally on an RTX 5090 via LM Studio for the vision pipeline, with no images leaving the box.

The thing nobody tells you about RAG

Retrieval is sharp when the chunking is good and the embedding model matches the modality. Retrieval is useless when the chunks are too big (the LLM dilutes them), too small (no context), or embedded with the wrong model. Image-search needs a vision-language embedding (CLIP, ViT). Code needs code-aware splitting. Legal text needs clause-aware splitting. The reason "just throw it in Pinecone" demos look great on slides and fall over in production is that the slide deck never shows you the chunking config. That config is the work.

Article 4: LangChain, LangGraph, and the orchestration problem

Around Week 14 the course shifted from "what is an LLM" to "how do you wire one into a real workflow." The two frameworks that took up most of that conversation were LangChain and LangGraph. I came out of Assignment 4 with a clear opinion about both, and an even clearer one about LlamaIndex, which is the third name in the same neighborhood.

What each framework is actually for

LangChain at its best is a way to express a pipeline. The headline pattern is LCEL, the LangChain Expression Language, which uses Python's pipe operator to compose a chain like prompt | model | output_parser. Reads left to right, runs streaming and async out of the box, and gives you LangSmith tracing for free. It is the right abstraction when your workflow is a directed acyclic pipeline: input goes in, a fixed sequence of components transforms it, output comes out.

LangGraph is the sibling for everything that does not fit that shape. It models the application as a state machine: nodes connected by edges, a shared State dictionary that every node reads from and writes to, and conditional edges that act like if/then routes. Cycles are first-class. An agent can call a tool, observe the result, and decide whether to call another tool or finish. Human-in-the-loop is supported through native breakpoints, and the State object can be inspected, rewound, and replayed.

LlamaIndex is the third in the group and is built specifically around the RAG case. Its two primitives, SimpleDirectoryReader plus VectorStoreIndex.from_documents, get you from a folder of mixed-format files to a working query engine in about ten lines. The framework used to be called GPTIndex, and it still feels opinionated about indexing.

How I now choose

After running all three in Assignment 4, my rough rule is this. Reach for LlamaIndex when I want a working RAG over a folder of documents in an afternoon and I do not yet know what tradeoffs matter to me. The LlamaHub connectors (Notion, Google Drive, GitHub, SQL, web pages) save real work. Reach for LangChain when the workflow is a fixed pipeline that ends. The textbook example I gave in Assignment 4 is a nightly batch job that summarizes each new customer support ticket and writes the summary plus a category tag to a database: load, prompt, parse, write. Four LCEL components, runs async over thousands of tickets, falls back to a secondary model if the primary one rate-limits. LangGraph here would be all overhead.

Reach for LangGraph when the workflow loops, branches, or pauses. Group 6's operator chat assistant from the parking project is the case where I would actually use it. The operator asks "how full is Allen Field House lot right now, and is it likely to fill before tipoff?" To answer, the agent has to decide what it needs (live YOLO occupancy snapshot, historical time series, campus event schedule, or some mix), maybe loop back for more if the first round of retrieval is thin, and probably pause for confirmation before pushing a digital-sign update. That is a state machine: cycles, conditional routing, human-in-the-loop. LangChain could handle it, but only with a lot of glue code that re-implements what LangGraph gives you free.

What I would push back on

Frameworks are not free. Every LangChain abstraction is also a thing you have to learn the conventions for, and the framework moves fast enough that half the Medium posts you find on Google are referencing an API that was deprecated six months ago. For really simple cases, I have started writing the four-line chain.invoke() myself with the raw provider SDK, because the framework's "convenience" stops being convenient when I am debugging a Runnable nesting bug at 11 p.m. The honest take is that LangChain is the right tool when a team is committing to a long-lived system that will benefit from streaming, async, fallbacks, and tracing. For a one-shot script or a demo, plain Python and the provider SDK is faster.

Conway's Law for AI

Rajapriya Sambandam quoted Conway's Law in her session and it has stuck with me. The structure of your AI system mirrors the communication structure of the team that built it. If your team is one person, you will probably end up with a single Python notebook calling one API. If your team has product, ML, and platform people, you will probably end up with a frontend service, an orchestration layer, and a model server. The framework choice is downstream of the team shape. That is uncomfortable to hear from inside a class group project, but it is the kind of thing that explains why some teams' systems feel coherent and others' feel like a Frankenstein.

Article 5: Fine-tuning, QLoRA, and the build-vs-buy question

Fine-tuning was the lecture I went into most skeptical and came out of most convinced. Going in, my prior was that prompt engineering plus retrieval would handle almost every case, and that fine-tuning was a luxury for teams with too much GPU. Coming out, my prior is that fine-tuning is the right tool for a narrow but real class of problems, and that QLoRA in particular has made it accessible on consumer hardware.

The right way to frame it

Karthik's framing from the fine-tuning lecture is the one that finally clicked: prompting and retrieval "improve how we use the model"; fine-tuning "improves the model" itself. Those are different problems. If the model knows the right answer and you cannot get it out, that is a prompt or retrieval problem. If the model does not know the answer, or has a habit you cannot prompt your way out of, that is a fine-tune problem.

For Assignment 4 I ran a QLoRA fine-tune on Qwen2.5-0.5B-Instruct to teach it a single fictional fact ("Acme Analytics earned $12.8 million in Q4 2025"). Before the tune, the base model correctly refused to answer because it had no way to know. After 15 epochs of training on four examples (2,162,688 trainable parameters out of about 317 million, or 0.68%), training loss dropped from 3.77 to 1.38, mean-token accuracy rose from 0.50 to 0.74, and the post-tune model answered the question verbatim. The whole training run took 2.6 seconds on the RTX 5090. That is when fine-tuning stopped being a textbook concept for me.

What QLoRA actually is

QLoRA stands for Quantized Low-Rank Adaptation. Two ideas stacked. Quantization compresses the base model's weights from 16-bit to 4-bit, which cuts VRAM usage by roughly 4x. Low-rank adaptation freezes those weights entirely and instead trains a tiny pair of low-rank matrices that get added to specific layers. The result is that I can fine-tune a 7B or even 31B model on a single consumer GPU, and the resulting adapter is tens of megabytes on disk, not gigabytes.

Two things follow from that. First, fine-tuning is now cheap enough that you can fine-tune for things you would not have bothered with before: style, schema compliance, domain-specific terminology. Second, you can keep many adapters around (one per customer, one per task, one per weather condition) and swap them at inference time without ever loading multiple copies of the base model.

The exact case I would fine-tune for

For Group 6's PKLot project I drafted (but did not yet train) an adapter for Gemma 4 31B to fix three repeatable failure modes. JSON preamble drift (the model occasionally wraps its answer in conversational text). Cloudy-weather under-counting (lighter shadows read as "uncertain"). Schema drift (extra keys or swapped occupied/empty lists). None of these are knowledge gaps, so retrieval would not help. They are output-style and visual-classification habits that need to be drilled into the weights. The training set would be a few thousand (image, target JSON) pairs sampled from PKLot, stratified by lot and weather, with the cloudy and rainy samples oversampled. Success looks like two numbers: per-space F1 lifted by five points on cloudy frames without regressing on sunny ones, and JSON parse success rate moving from about 92% to 99% or better.

When fine-tuning is the wrong answer

If your problem is that the model needs to know your customer's order number, fine-tuning is the wrong tool. The customer order number changes every day; the model's weights change in a multi-hour training run. That is retrieval's job. If your problem is "make the model less wrong about a thing you can describe in words", prompt engineering is the wrong tool only after you have tried it. And if your problem is "use this proprietary API five tools deep", that is agentic plus tool-use territory, not fine-tuning.

The build-vs-buy question I keep coming back to is the one John Ware framed in his Archware session. He was clear that for his architecture-review use case, off-the-shelf frontier models plus careful prompting and vector retrieval would get him most of the value, and the cost of running a fine-tune was hard to justify on the relatively small number of projects he reviews per year. That is the right answer for his domain. For a high-volume, narrow, repeatable task with a clear failure mode, fine-tuning is the right answer. The decision is rarely about whether fine-tuning is possible. It is about whether the volume justifies the iteration loop slowing from minutes (prompts) to hours (training).

Article 6: Lessons from shipping a multimodal AI: PKLot parking detection with Gemma

The most useful thing about BSAN 765 was that the group project forced me to actually ship a working multimodal AI system, not just talk about one. Group 6 spent the second half of the semester turning Google's open weights Gemma 4 31B model into a self-calibrating parking-occupancy detector for the Kaggle PKLot dataset (roughly 12,400 overhead images across three South American campuses). What follows are the things I would tell my January self if I could go back and do it again.

The architecture that finally worked

The pipeline that survived contact with real images has four flows. Flow A0 runs once per camera: Gemma looks at an empty-lot reference frame and emits a JSON "lot map" describing every painted parking rectangle. That map becomes the runtime anchor, replacing hand-annotated XML for any new lot we onboard. Flow A runs every two to five minutes per camera: Gemma classifies each anchored space as occupied, empty, or uncertain on the latest frame, and the result is cached. Flow B is read-only and feeds three customer-facing surfaces (a mobile heat-map, a digital lot-entrance sign, and an operator dashboard). Flow C is the only flow with a free-text prompt, where an operator can ask "is the south lot likely to fill before tipoff?" and get a grounded answer.

The reason this layout matters is that the expensive thing (running a 31B parameter vision-language model) only happens on a small cadence, and the cheap things (rendering a heat-map, answering a chat) read from the cache. Most production AI systems I have seen since this project look like some version of this same split.

Where it actually broke

Three failure modes showed up in our test runs, and none of them were the dramatic ones the AI safety literature warns about. They were boring, and that was the point. First, Gemma sometimes wrapped its JSON answer in conversational preamble ("Sure, here's the lot map: ..."), which broke any downstream parser that expected a single JSON object. Second, on cloudy and rainy PKLot variants, it under-counted occupancy because cars cast lighter shadows that the model read as "uncertain." Third, the schema occasionally drifted: extra keys, missing confidence fields, or swapping the occupied and empty lists.

I tried to fix all three by prompt engineering first, which is exactly what the course taught me to do before reaching for heavier tools. Stricter system prompts, explicit JSON schema, few-shot examples, role priming as a strict parser. Each of these helped a little, then plateaued. That is where Karthik's framing from the fine-tuning lecture finally clicked. Prompt engineering and retrieval "improve how we use the model"; fine-tuning "improves the model" itself. The cloudy-frame failure mode is not a knowledge gap that retrieval could fix, it is a visual-classification habit the weights need to learn.

Why QLoRA is the right tool for this exact problem

A full fine-tune of Gemma 4 31B is out of reach on a single RTX 5090. QLoRA solves three problems at once. It trains a small low-rank adapter on top of a 4-bit quantized base model, which fits in consumer VRAM. The adapter is tens of megabytes on disk, so I can keep a separate variant for each lot (PUCPR, UFPR04, UFPR05) and weather class (sunny, cloudy, rainy) without storing multiple 60 GB model copies. And because the pretrained weights are frozen, the base model's general vision-language capability stays intact, which sidesteps the catastrophic-forgetting risk we covered in lecture.

My target training set is a few thousand (image, target_JSON) pairs sampled from the existing PKLot XML annotations, stratified by lot and weather, with the cloudy and rainy samples oversampled. The success metric is two numbers: per-space F1 on a held-out test set, and JSON-parse success rate. The current baseline is around 92% parse success; the target is 99% or better. If the adapter lifts cloudy/rainy F1 by five points without regressing sunny weather, the fine-tune was worth doing. If it does not, prompting was the right ceiling.

What I would tell my January self

Ship the boring version first. Our first pipeline used hand-annotated XML at runtime because it was already in the dataset. It worked, and it let us postpone the "have Gemma calibrate itself" feature until after the rest of the system was real. Cache aggressively, because the model call is always the long pole. Treat every JSON output as untrusted user input, even when the user is your own model. And do not reach for fine-tuning until prompt engineering has visibly stopped paying off, because the moment you start training, the iteration loop slows from minutes to hours.

The single biggest thing I learned from this project is that the interesting decisions in applied AI are not about which model to use. They are about which parts of the pipeline you let the model touch, where you cache, and what you do when the model is wrong. The model is the easy part. Everything around it is the work.

Closing notes

What I am taking with me from this course

Three things I will keep using long after the semester ends.

First, the four-lever framework. Pre-trained model plus prompting, retrieval, agents, or fine-tuning. Picking the right lever for the problem is most of the work. Skipping that diagnostic step is how teams end up with a fine-tune for what was really a prompt engineering problem, or a multi-agent system for what was really a four-line script.

First-principles evaluation. Rajapriya Sambandam's question ("your prompt can serve 10 customers, but can it serve 100,000?") changed how I think about every AI system I work on now. Eyeballing two outputs is not evaluation. Even rough deterministic checks plus a small LLM-as-judge eval beat no harness at all.

The human-in-the-loop pattern. John Ware's framing of AI as "an accelerator, not a replacement" is the right operating posture for every business use case I have looked at this semester. The places where AI is doing the most value-creating work right now (Morgan Stanley's research assistant, Harvey for legal drafting, Klarna's customer support) are all built on this pattern. The places where it has gone badly are the ones that tried to remove the human entirely.

Acknowledgements

Thanks to Professor Karthik for the course design, the four-lever taxonomy, and the willingness to let students run real GPU workloads instead of canned Colab demos. Thanks to the guest speakers (John Ware of Archware, Sameer Patil, Ashwini Vasanth of Nutanix, Rajapriya Sambandam of Peppermint AI) for showing what production AI work actually looks like outside a classroom. And thanks to Group 6 for being patient with the model when it returned conversational preamble instead of JSON for the fourth time in a row.

GenAI tool disclosure

Per the assignment guidelines: Claude Code (Anthropic) was used to format and structure this document. The writing itself is mine.

Gerald-Jinx-Mouse/keiter_jacob_individual_essay_report.md