Skip to content

Instantly share code, notes, and snippets.

View lukehinds's full-sized avatar
🏃‍♂️
Either running or coding.

Luke Hinds lukehinds

🏃‍♂️
Either running or coding.
View GitHub Profile

DeepFabric Dataset Tools

Utility scripts for analyzing, filtering, and cleaning synthetic datasets generated by DeepFabric.

Scripts

filter_tool_dataset.py

Generic quality filter for tool-calling datasets. Removes problematic patterns that can cause models to develop bad habits during training.

#!/usr/bin/env python3
"""
Generic Dataset Quality Filter for Tool-Calling Datasets
This script filters out problematic patterns from ANY synthetic tool-calling dataset
that can cause models to develop bad habits during training.
Key features:
1. Auto-detection mode: Discovers problematic patterns from the data itself
2. Schema-agnostic: Works with any tool-calling dataset (Blender, Kubernetes, GitHub, etc.)
#!/usr/bin/env python3
"""
Script to detect and optionally remove duplicate topics in JSON graph files.
Uses SHA256 checksums (already computed in node metadata) and other matching
strategies to identify duplicate topics.
Example usage:
# Report duplicates using exact hash matching
python tools/dedupe_graph.py --input examples/basic-graph-topics.jsonl
#####################################################################
# Spin Blender Tools Dataset Configuration
#####################################################################
# This configuration demonstrates using Blender MCP tools via Spin
# for generating synthetic 3D design assistant training data.
#
# Prerequisites:
# 1. Start the Spin service:
# cd tools-sdk
# spin build && spin up
#!/bin/bash
# Load comprehensive Blender MCP mock data into the mock tools server
#
# Usage: ./load-blender-mock-data.sh [base_url]
# Default base_url: http://localhost:3000
set -e
BASE_URL="${1:-http://localhost:3000}"
SCRIPT_DIR="$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)"
{
"description": "Comprehensive Blender MCP mock data for testing tool execution with 3D design assistant scenarios",
"version": "1.0.0",
"mockResponses": {
"get_scene_info": {
"defaultResponse": {
"name": "Untitled",
"objects": [],
"activeObject": null,
"renderEngine": "CYCLES",
{
"description": "Comprehensive Google Workspace mock data for testing tool execution with productivity assistant scenarios",
"version": "1.0.0",
"mockResponses": {
"search_gmail_messages": {
"defaultResponse": {
"messages": [
{
"id": "msg_001",
"threadId": "thread_001",
#####################################################################
# Spin Google Workspace Tools Dataset Configuration
#####################################################################
# This configuration demonstrates using Google Workspace MCP tools via Spin
# for generating synthetic productivity assistant training data.
#
# Prerequisites:
# 1. Start the Spin service:
# cd tools-sdk
# spin build && spin up
#!/bin/bash
# Load comprehensive Google Workspace mock data into the mock tools server
#
# Usage: ./load-google-workspace-mock-data.sh [base_url]
# Default base_url: http://localhost:3000
set -e
BASE_URL="${1:-http://localhost:3000}"
SCRIPT_DIR="$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)"
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.