Utility scripts for analyzing, filtering, and cleaning synthetic datasets generated by DeepFabric.
Generic quality filter for tool-calling datasets. Removes problematic patterns that can cause models to develop bad habits during training.
| #!/usr/bin/env python3 | |
| """ | |
| Generic Dataset Quality Filter for Tool-Calling Datasets | |
| This script filters out problematic patterns from ANY synthetic tool-calling dataset | |
| that can cause models to develop bad habits during training. | |
| Key features: | |
| 1. Auto-detection mode: Discovers problematic patterns from the data itself | |
| 2. Schema-agnostic: Works with any tool-calling dataset (Blender, Kubernetes, GitHub, etc.) |
| #!/usr/bin/env python3 | |
| """ | |
| Script to detect and optionally remove duplicate topics in JSON graph files. | |
| Uses SHA256 checksums (already computed in node metadata) and other matching | |
| strategies to identify duplicate topics. | |
| Example usage: | |
| # Report duplicates using exact hash matching | |
| python tools/dedupe_graph.py --input examples/basic-graph-topics.jsonl |
| ##################################################################### | |
| # Spin Blender Tools Dataset Configuration | |
| ##################################################################### | |
| # This configuration demonstrates using Blender MCP tools via Spin | |
| # for generating synthetic 3D design assistant training data. | |
| # | |
| # Prerequisites: | |
| # 1. Start the Spin service: | |
| # cd tools-sdk | |
| # spin build && spin up |
| #!/bin/bash | |
| # Load comprehensive Blender MCP mock data into the mock tools server | |
| # | |
| # Usage: ./load-blender-mock-data.sh [base_url] | |
| # Default base_url: http://localhost:3000 | |
| set -e | |
| BASE_URL="${1:-http://localhost:3000}" | |
| SCRIPT_DIR="$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)" |
| { | |
| "description": "Comprehensive Blender MCP mock data for testing tool execution with 3D design assistant scenarios", | |
| "version": "1.0.0", | |
| "mockResponses": { | |
| "get_scene_info": { | |
| "defaultResponse": { | |
| "name": "Untitled", | |
| "objects": [], | |
| "activeObject": null, | |
| "renderEngine": "CYCLES", |
| { | |
| "description": "Comprehensive Google Workspace mock data for testing tool execution with productivity assistant scenarios", | |
| "version": "1.0.0", | |
| "mockResponses": { | |
| "search_gmail_messages": { | |
| "defaultResponse": { | |
| "messages": [ | |
| { | |
| "id": "msg_001", | |
| "threadId": "thread_001", |
| ##################################################################### | |
| # Spin Google Workspace Tools Dataset Configuration | |
| ##################################################################### | |
| # This configuration demonstrates using Google Workspace MCP tools via Spin | |
| # for generating synthetic productivity assistant training data. | |
| # | |
| # Prerequisites: | |
| # 1. Start the Spin service: | |
| # cd tools-sdk | |
| # spin build && spin up |
| #!/bin/bash | |
| # Load comprehensive Google Workspace mock data into the mock tools server | |
| # | |
| # Usage: ./load-google-workspace-mock-data.sh [base_url] | |
| # Default base_url: http://localhost:3000 | |
| set -e | |
| BASE_URL="${1:-http://localhost:3000}" | |
| SCRIPT_DIR="$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)" |