Skip to content

Instantly share code, notes, and snippets.

@leleu
Created June 3, 2025 05:53
Show Gist options
  • Save leleu/38148d8319558335de85384fd8e6a2c3 to your computer and use it in GitHub Desktop.
Save leleu/38148d8319558335de85384fd8e6a2c3 to your computer and use it in GitHub Desktop.
{
"cells": [
{
"cell_type": "markdown",
"id": "llama_intro_class_01",
"metadata": {},
"source": [
"# 🛠️ Workshop: Classifying QA Bug Tickets with Llama Models\n",
"\n",
"## Goals (1 Hour):\n",
"- Understand how to use Llama models for text classification tasks.\n",
"- Define relevant categories for QA bug tickets.\n",
"- Craft effective prompts to instruct Llama to classify tickets accurately.\n",
"- Discuss strategies for improving classification robustness and handling ambiguous cases.\n",
"- Briefly explore parameter tuning (temperature) for Llama model's classification output."
]
},
{
"cell_type": "markdown",
"id": "llama_setup_class_01",
"metadata": {},
"source": [
"## Setting the Stage: Automating Bug Ticket Triage with Llama\n",
"\n",
"Manually triaging and categorizing bug tickets filed by QA can be time-consuming. Large language models like Llama can assist by automatically classifying these tickets based on their descriptions. This workshop explores how to set up a simple Llama-powered classification pipeline.\n",
"\n",
"## What You'll See:\n",
"- **Llama API Interaction**: Setting up calls to a Llama API endpoint.\n",
"- **Defining Categories**: Establishing a clear set of bug ticket categories.\n",
"- **Prompt Engineering for Classification**: Instructing Llama to assign a category.\n",
"- **Output Parsing**: Extracting the predicted category from Llama's text response.\n",
"- **Considering Edge Cases**: How to improve accuracy and handle unclear classifications."
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "llama_api_keys_class_01",
"metadata": {},
"outputs": [],
"source": [
"import os\n",
"import json\n",
"import requests # For making HTTP requests to the Llama API\n",
"import re # For parsing the classification output\n",
"from typing import List, Optional, Dict, Any\n",
"from IPython.display import Markdown, display\n",
"\n",
"# IMPORTANT: Configure your Llama API endpoint\n",
"LLAMA_API_ENDPOINT = os.environ.get('LLAMA_API_ENDPOINT', 'http://localhost:8080/completion')\n",
"\n",
"def printmd(string):\n",
" display(Markdown(string))\n",
"\n",
"printmd(f\"Using Llama API Endpoint: `{LLAMA_API_ENDPOINT}`\")"
]
},
{
"cell_type": "markdown",
"id": "llama_api_helper_class_01",
"metadata": {},
"source": [
"### Helper Function for Llama API Calls\n",
"(This helper function `call_llama_api` can remain largely the same as in the previous Llama example, as it's a generic API caller. Ensure the prompt templating within it is appropriate for your Llama model version, e.g., Llama 3 Instruct.)"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "llama_api_helper_code_class_01",
"metadata": {},
"outputs": [],
"source": [
"def call_llama_api(system_prompt: str, user_prompt: str, temperature: float = 0.1, max_tokens: int = 100) -> Optional[str]: # Reduced max_tokens for classification\n",
" \"\"\"Calls a Llama API (e.g., llama.cpp server's /completion endpoint).\"\"\"\n",
" full_prompt = f\"<|begin_of_text|><|start_header_id|>system<|end_header_id|>\\n{system_prompt}<|eot_id|><|start_header_id|>user<|end_header_id|>\\n{user_prompt}<|eot_id|><|start_header_id|>assistant<|end_header_id|>\\n\"\n",
" \n",
" payload = {\n",
" \"prompt\": full_prompt,\n",
" \"temperature\": temperature,\n",
" \"n_predict\": max_tokens, \n",
" \"stop\": [\"<|eot_id|>\", \"<|end_header_id|>\", \"\\n\"], # Stop on newline for simple classification output\n",
" }\n",
"\n",
" try:\n",
" response = requests.post(LLAMA_API_ENDPOINT, json=payload, timeout=60)\n",
" response.raise_for_status()\n",
" response_data = response.json()\n",
" generated_text = response_data.get('content', '').strip()\n",
" # printmd(f\"**LLM Raw Response:**\\n```\\n{generated_text}\\n```\") # For debugging\n",
" return generated_text\n",
" except requests.exceptions.RequestException as e:\n",
" printmd(f\"**API Call Error:** `{e}`\")\n",
" return None\n",
" except json.JSONDecodeError as e:\n",
" printmd(f\"**API Response JSON Decode Error:** `{e}`. Response text: `{response.text}`\")\n",
" return None"
]
},
{
"cell_type": "markdown",
"id": "llama_classification_def_01",
"metadata": {},
"source": [
"## 1. Defining Bug Ticket Categories\n",
"\n",
"For QA bug tickets, common categories might include:\n",
"- **UI/UX**: Issues related to user interface elements, layout, or user experience flows.\n",
"- **Functionality**: Core features of the application not working as expected.\n",
"- **Performance**: Problems related to speed, responsiveness, or resource usage.\n",
"- **Security**: Vulnerabilities or issues related to data protection and access control.\n",
"- **Data**: Issues with data integrity, storage, or retrieval.\n",
"- **Crash/Error**: Application crashes, freezes, or unhandled exceptions.\n",
"- **Documentation**: Errors or omissions in user guides, API docs, or help text.\n",
"- **Other**: For issues not fitting neatly into other categories.\n",
"\n",
"It's crucial these categories are well-defined and mutually exclusive as much as possible for clear classification."
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "llama_categories_code_01",
"metadata": {},
"outputs": [],
"source": [
"BUG_CATEGORIES = [\n",
" \"UI/UX\",\n",
" \"Functionality\",\n",
" \"Performance\",\n",
" \"Security\",\n",
" \"Data\",\n",
" \"Crash/Error\",\n",
" \"Documentation\",\n",
" \"Other\"\n",
"]\n",
"\n",
"printmd(\"**Defined Bug Categories:**\")\n",
"for category in BUG_CATEGORIES:\n",
" printmd(f\"- {category}\")"
]
},
{
"cell_type": "markdown",
"id": "llama_classification_prompt_01",
"metadata": {},
"source": [
"## 2. Prompt Engineering for Ticket Classification\n",
"\n",
"We need to instruct Llama to choose **one** category from our predefined list. The prompt should clearly state the task, provide the list of categories, and present the bug ticket description.\n",
"\n",
"**Strategy for reliable output:** Ask the model to output *only* the category name."
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "llama_classification_code_01",
"metadata": {},
"outputs": [],
"source": [
"def classify_bug_ticket(ticket_description: str, categories: List[str]) -> Optional[str]:\n",
" printmd(f\"**Classifying Ticket:**\\n```\\n{ticket_description}\\n```\")\n",
"\n",
" # Create a numbered list of categories for the prompt\n",
" category_list_str = \"\\n\".join([f\"{i+1}. {cat}\" for i, cat in enumerate(categories)])\n",
"\n",
" system_prompt_text = f\"\"\"\n",
"You are an expert QA Bug Triage system. Your task is to classify the given bug ticket description into ONE of the following categories.\n",
"Respond with ONLY the category name from the list. Do not add any other text, explanations, or punctuation.\n",
"\n",
"Available Categories:\n",
"{category_list_str}\n",
"\"\"\"\n",
" # Few-shot examples can significantly improve performance for classification\n",
" # Adding one simple example here. More examples, especially for tricky cases, are better.\n",
" few_shot_example_user = \"Bug Ticket: The login button on the main page is not responding when clicked.\\n---\\nClassify this ticket using one of the categories above.\"\n",
" few_shot_example_assistant = \"Functionality\"\n",
" \n",
" # We can try to build a more robust prompt that includes the few-shot example\n",
" # For the call_llama_api, the full prompt is constructed with system and user parts.\n",
" # We can put the few-shot example in the user prompt or interleave it if the model supports chat format well.\n",
" # Simpler approach: put examples in the system prompt, or as part of a longer user prompt if not using explicit chat turns.\n",
"\n",
" # Let's try a simpler user prompt for the first attempt and then consider few-shot.\n",
" user_prompt_text = f\"Bug Ticket Description:\\n```\\n{ticket_description}\\n```\\n---\\nBased on the description, what is the single most appropriate category from the provided list? Respond with only the category name.\"\n",
"\n",
" # If using a model that handles chat history well, you could format few-shot like this:\n",
" # messages = [\n",
" # {\"role\": \"system\", \"content\": system_prompt_text},\n",
" # {\"role\": \"user\", \"content\": few_shot_example_user},\n",
" # {\"role\": \"assistant\", \"content\": few_shot_example_assistant},\n",
" # {\"role\": \"user\", \"content\": user_prompt_text}\n",
" # ]\n",
" # Then call_llama_api would need to be adapted to handle a list of messages.\n",
" # For now, keeping call_llama_api with single system/user prompt strings.\n",
"\n",
" llama_response_text = call_llama_api(system_prompt_text, user_prompt_text, temperature=0.05, max_tokens=20) # Low temp, few tokens\n",
"\n",
" if not llama_response_text:\n",
" printmd(\"**LLM did not return a response.**\")\n",
" return None\n",
"\n",
" # Clean up the response: sometimes models add extra newlines or spaces\n",
" predicted_category_raw = llama_response_text.strip()\n",
" printmd(f\"**LLM Raw Prediction:** `{predicted_category_raw}`\")\n",
"\n",
" # Validate if the predicted category is one of the known categories\n",
" # Simple exact match for now. Could use fuzzy matching for robustness.\n",
" best_match = None\n",
" for cat in categories:\n",
" if cat.lower() == predicted_category_raw.lower():\n",
" best_match = cat\n",
" break\n",
" # A more robust check could involve looking for the category string within the response\n",
" # if cat.lower() in predicted_category_raw.lower():\n",
" # best_match = cat\n",
" # break \n",
"\n",
" if best_match:\n",
" printmd(f\"**Validated Category:** `{best_match}`\")\n",
" return best_match\n",
" else:\n",
" printmd(f\"**Warning: LLM prediction (`{predicted_category_raw}`) is not an exact match to a defined category. It might be a hallucination or slight variation. Consider adding it or refining the prompt.**\")\n",
" # Option: return the raw prediction if you want to handle variations downstream\n",
" # return predicted_category_raw \n",
" return None # Or return a default like 'Other'\n",
"\n",
"# Example Bug Tickets\n",
"ticket1 = \"User login fails with 'Incorrect Password' error even when correct credentials are used. Tested on Chrome and Firefox. Backend logs show no auth attempt.\"\n",
"ticket2 = \"The checkout button is misaligned on mobile screens (iPhone 12, Safari). It overlaps with the order summary text, making it hard to click.\"\n",
"ticket3 = \"Application becomes unresponsive and eventually crashes when uploading a file larger than 500MB. No error message is displayed before the crash.\"\n",
"ticket4 = \"The API endpoint /api/users/{id} returns a 500 internal server error if the user ID contains special characters like '!'. This could be an injection risk.\"\n",
"ticket5 = \"The help text for the 'Advanced Search' feature is outdated and refers to UI elements that no longer exist after the recent redesign.\"\n",
"\n",
"classified_ticket1 = classify_bug_ticket(ticket1, BUG_CATEGORIES)\n",
"print(\"-\"*50)\n",
"classified_ticket2 = classify_bug_ticket(ticket2, BUG_CATEGORIES)\n",
"print(\"-\"*50)\n",
"classified_ticket3 = classify_bug_ticket(ticket3, BUG_CATEGORIES)\n",
"print(\"-\"*50)\n",
"classified_ticket4 = classify_bug_ticket(ticket4, BUG_CATEGORIES)\n",
"print(\"-\"*50)\n",
"classified_ticket5 = classify_bug_ticket(ticket5, BUG_CATEGORIES)"
]
},
{
"cell_type": "markdown",
"id": "llama_classification_robust_01",
"metadata": {},
"source": [
"### Discussion: Improving Classification Robustness\n",
"- **Clear Category Definitions**: The clearer your categories and their descriptions (even if only for human understanding initially), the better the LLM can be prompted.\n",
"- **Few-Shot Prompting**: Providing 1-3 examples of ticket descriptions and their correct categories *within the prompt* can significantly improve accuracy, especially for ambiguous cases. Llama models respond well to this.\n",
" - Example: `System: ... \n User: Ticket: 'Button X is blue'. Category: UI/UX \n User: Ticket: 'App hangs on load'. Category: Performance \n User: Ticket: '{new_ticket_description}'. Category: ?`\n",
"- **Strict Output Parsing**: If the model doesn't strictly output only the category, use regex or string matching to find a valid category within its response. Our current `classify_bug_ticket` does a basic check.\n",
"- **Handling Uncertainty/\"Other\"**: Decide how to handle cases where the LLM is uncertain or the ticket genuinely doesn't fit. The 'Other' category is a fallback. You could also prompt the LLM to output 'UNCERTAIN' if no category fits well.\n",
"- **Iterative Prompt Refinement**: If you see common misclassifications, adjust your main system prompt or add few-shot examples that specifically address those errors.\n",
"- **Temperature**: Keep temperature low (0.0-0.2) for classification to get the most likely, deterministic category."
]
},
{
"cell_type": "markdown",
"id": "llama_classification_batch_01",
"metadata": {},
"source": [
"## 3. Batch Classification and Simple Analysis (Optional)\n",
"\n",
"We can loop through multiple tickets and then perform a simple analysis, like counting tickets per category."
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "llama_classification_batch_code_01",
"metadata": {},
"outputs": [],
"source": [
"all_tickets_data = [\n",
" {\"id\": \"T001\", \"description\": ticket1, \"predicted_category\": classified_ticket1},\n",
" {\"id\": \"T002\", \"description\": ticket2, \"predicted_category\": classified_ticket2},\n",
" {\"id\": \"T003\", \"description\": ticket3, \"predicted_category\": classified_ticket3},\n",
" {\"id\": \"T004\", \"description\": ticket4, \"predicted_category\": classified_ticket4},\n",
" {\"id\": \"T005\", \"description\": ticket5, \"predicted_category\": classified_ticket5},\n",
" # Add a ticket that might be harder to classify or be 'Other'\n",
" {\"id\": \"T006\", \"description\": \"The loading spinner icon is not aligned with WCAG accessibility contrast guidelines.\", \"predicted_category\": classify_bug_ticket(\"The loading spinner icon is not aligned with WCAG accessibility contrast guidelines.\", BUG_CATEGORIES)}\n",
"]\n",
"\n",
"printmd(\"\\n## Batch Classification Results:\")\n",
"category_counts = {cat: 0 for cat in BUG_CATEGORIES}\n",
"unclassified_count = 0\n",
"\n",
"for ticket_data in all_tickets_data:\n",
" printmd(f\"- Ticket {ticket_data['id']}: {ticket_data['predicted_category'] or 'UNCLASSIFIED'}\")\n",
" if ticket_data['predicted_category'] in category_counts:\n",
" category_counts[ticket_data['predicted_category']] += 1\n",
" elif ticket_data['predicted_category']:\n",
" # Handle case where LLM might have returned a slight variation not caught by simple validation\n",
" printmd(f\" (Note: Predicted category '{ticket_data['predicted_category']}' was not in the initial list but was returned by LLM)\")\n",
" if 'Other' in category_counts: category_counts['Other'] +=1 # Or add to a new 'LLM_Generated_Category'\n",
" else: unclassified_count +=1\n",
" else:\n",
" unclassified_count += 1\n",
"\n",
"printmd(\"\\n## Category Counts:\")\n",
"for category, count in category_counts.items():\n",
" if count > 0:\n",
" printmd(f\"- {category}: {count}\")\n",
"if unclassified_count > 0:\n",
" printmd(f\"- Unclassified / Failed: {unclassified_count}\")"
]
},
{
"cell_type": "markdown",
"id": "llama_temp_class_01",
"metadata": {},
"source": [
"## 4. Parameter Tuning: Temperature for Classification\n",
"\n",
"For classification, we generally want the most probable and consistent answer. \n",
"- **Low `temperature` (e.g., 0.0 - 0.2)** is highly recommended. This makes the output more deterministic.\n",
"- Higher temperatures might cause the LLM to be \"creative\" with category names or add extra text, which is undesirable for this task."
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "llama_temp_example_class_01",
"metadata": {},
"outputs": [],
"source": [
"ambiguous_ticket = \"The system feels slow sometimes when I search, and yesterday the search button looked weird after I clicked it too fast. Not sure if it's one thing or many.\"\n",
"\n",
"temperatures_to_test = [0.05, 0.5, 0.9]\n",
"\n",
"printmd(f\"**Classifying Ambiguous Ticket:**\\n```\\n{ambiguous_ticket}\\n```\")\n",
"\n",
"for temp in temperatures_to_test:\n",
" printmd(f\"--- Temperature = {temp} ---\")\n",
" # Re-using the classify_bug_ticket function, which sets its own temperature.\n",
" # To demonstrate temperature effect here, we'd ideally call call_llama_api directly or modify classify_bug_ticket\n",
" \n",
" # Direct call for demonstration:\n",
" category_list_str = \"\\n\".join([f\"{i+1}. {cat}\" for i, cat in enumerate(BUG_CATEGORIES)])\n",
" system_prompt_text = f\"You are an expert QA Bug Triage system. Classify the bug ticket into ONE of the following categories. Respond with ONLY the category name.\\nCategories:\\n{category_list_str}\"\n",
" user_prompt_text = f\"Bug Ticket Description:\\n```\\n{ambiguous_ticket}\\n```\\n---\\nCategory?\"\n",
"\n",
" prediction = call_llama_api(system_prompt_text, user_prompt_text, temperature=temp, max_tokens=20)\n",
" if prediction:\n",
" printmd(f\"**LLM Prediction (Temp {temp}):** `{prediction}`\")\n",
" # Basic validation\n",
" validated = any(cat.lower() == prediction.strip().lower() for cat in BUG_CATEGORIES)\n",
" printmd(f\" Valid Category Match: {validated}\")\n",
" else:\n",
" printmd(f\"**Failed to get prediction with Temperature = {temp} (Llama)**\")"
]
},
{
"cell_type": "markdown",
"id": "llama_recap_homework_class_01",
"metadata": {},
"source": [
"## 5. Recap & Key Takeaways for Llama Classification\n",
"\n",
"- **Clear Task Definition**: Clearly define your categories and what constitutes each.\n",
"- **Explicit Prompting**: Instruct Llama to output *only* the category name from your list.\n",
"- **Few-Shot Learning**: Consider adding examples to your prompt for better accuracy, especially with Llama models.\n",
"- **Output Validation**: Always validate the LLM's output against your known categories.\n",
"- **Low Temperature**: Use a low temperature (0.0-0.2) for deterministic classification.\n",
"- **Iteration is Key**: Expect to iterate on your prompts and category definitions based on observed Llama performance.\n",
"\n",
"## Homework Suggestion (Llama Classification Focus):\n",
"\n",
"1. Choose a different text classification task relevant to your work (e.g., sentiment analysis of customer feedback, topic modeling of documents, intent recognition from user queries).\n",
"2. Define 3-5 clear categories for your chosen task.\n",
"3. Collect 5-10 example texts for classification.\n",
"4. Craft a Llama prompt (including system instructions and user input format) to classify these texts. Experiment with including 1-2 few-shot examples directly in your prompt.\n",
"5. Write Python code to call your Llama API, get the classification, and validate the output.\n",
"6. Test your prompt on your example texts. How accurate is it? What kind of errors does Llama make? How could you refine your prompt or category definitions to improve it?\n",
"\n",
"Please spend a few minutes providing feedback on this session [here](https://forms.gle/YOUR_FEEDBACK_LINK_HERE)!"
]
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3 (ipykernel)",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.11.7"
}
},
"nbformat": 4,
"nbformat_minor": 5
}
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment