A containerized AI Toolkit setup with MCP (Model Context Protocol) integration for training LoRA models and fine-tuning diffusion models. This provides a complete solution for training custom LoRA models with full MCP integration, allowing AI assistants to manage the entire training workflow.
See the template repository for a complete example. Also includes the ComfyUI MCP Server used for creating images/videos from the trained models.
- Fully Containerized: AI Toolkit and MCP server run in Docker containers
- NVIDIA GPU Support: Full CUDA 12.1 support for GPU-accelerated training
- Persistent Storage: All datasets, configs, and outputs are persisted via volume mounts
- MCP Integration: AI assistants can create training configs and manage training jobs via MCP tools
- HTTP MCP API: HTTP MCP API for easy integration with remote agents
- Dataset Management: Upload and manage training datasets via MCP
- Smart Configuration: Automatic test prompt generation with trigger word integration
- Training Monitoring: Real-time training status and progress tracking
- Model Download: Download trained models directly through MCP
- Web UI: Access AI Toolkit's web interface at http://localhost:8675
- Docker and Docker Compose
- NVIDIA GPU with Docker GPU support (
nvidia-docker2) - CUDA-compatible GPU (CUDA 12.1)
- At least 30GB free disk space for models and datasets
- 24GB+ VRAM recommended for FLUX LoRA training (lower VRAM possible with
low_vrammode)
-
Build the Docker images:
docker-compose build
-
Start the services:
# Start AI Toolkit and standard MCP server docker-compose up -d # Optional: Start HTTP API server for easier integration docker-compose up -d mcp-http-server
-
Access the services:
- AI Toolkit Web UI: http://localhost:8675
- MCP HTTP API: http://localhost:8190 (if started)
- MCP stdio: Available in the
mcp-ai-toolkit-trainercontainer
model-trainer-mcp/
├── docker-compose.yml # Container orchestration
├── Dockerfile # Container image definition
├── mcp_server.py # MCP server implementation
├── mcp_http_server.py # HTTP API wrapper
├── requirements.txt # Python dependencies
└── example_training.py # Example usage script
Note: All files are at the root level for GitHub Gist compatibility. When deployed:
- Datasets are stored in Docker volumes at
/ai-toolkit/datasets/ - Configs are saved to
/ai-toolkit/configs/ - Outputs are written to
/ai-toolkit/outputs/ - Logs are stored in
/ai-toolkit/logs/
Create a new LoRA training configuration with customizable parameters.
Parameters:
name(required): Name for the training jobmodel_name(required): Base model name (e.g., "ostris/Flex.1-alpha", "runwayml/stable-diffusion-v1-5")- Note: Use publicly accessible models.
black-forest-labs/FLUX.1-devrequires authentication
- Note: Use publicly accessible models.
dataset_path(required): Path to the dataset folder (e.g., "/ai-toolkit/datasets/my-dataset")resolution: Training resolution in pixels (default: 512)batch_size: Training batch size (default: 1)learning_rate: Learning rate (default: 0.0002)steps: Number of training steps (default: 1000)rank: LoRA rank - higher for more complex concepts (default: 16)alpha: LoRA alpha - typically same as rank (default: 16)use_wandb: Enable Weights & Biases logging (default: false)low_vram: Enable low VRAM mode for GPUs with <24GB (default: true)trigger_word: Unique trigger word for activating the LoRAtest_prompts: Array of test prompts for validation (recommended: 4 prompts)- Include 3 similar prompts and 1 unique/creative prompt
- All prompts must include the trigger word
- If not provided, default prompts will be auto-generated
disable_sampling: Disable sample image generation during training (default: false)- Useful for faster training when you don't need intermediate samples
- Significantly reduces training time by skipping image generation
List all available training configurations.
Retrieve a specific training configuration by name.
Get training information including trigger word and test prompts for a configuration.
Parameters:
name(required): Configuration name
Upload images with captions to create a new training dataset.
Parameters:
dataset_name(required): Name for the datasetimages(required): Array of images with:filename: Image filenamecontent: Base64-encoded image contentcaption: Caption/description for the image
List available datasets with image counts.
Start a training job using a saved configuration.
Parameters:
config_name(required): Name of the configuration to use
Get the current status of a training job.
Parameters:
job_id(required): Training job ID
Stop a running training job.
Parameters:
job_id(required): Training job ID to stop
List all training jobs and their statuses.
Export a trained model in the specified format.
Parameters:
job_id(required): Training job IDformat: Export format ("safetensors" or "ckpt", default: "safetensors")
List all trained models available in the outputs directory.
Download a trained LoRA model as base64-encoded content.
Parameters:
model_path(required): Path to the model file (relative to outputs directory)include_metadata: Include training metadata if available (default: true)
Returns:
- Base64-encoded model content
- Model metadata (if available)
- Training configuration (if available)
Get AI Toolkit system statistics including GPU usage.
Retrieve logs for a specific training job.
Parameters:
job_id(required): Training job IDlines: Number of log lines to retrieve (default: 100)
The MCP server can optionally be accessed via HTTP for easier integration.
docker-compose up -d mcp-http-serverGET /- API documentationGET /health- Health checkPOST /mcp/tool- Execute any MCP toolGET /mcp/tools- List available toolsGET /datasets- List datasetsGET /configs- List training configurations
curl -X POST http://localhost:8190/mcp/tool \
-H "Content-Type: application/json" \
-d '{
"tool": "create-training-config",
"arguments": {
"name": "my-flux-lora",
"model_name": "ostris/Flex.1-alpha",
"dataset_path": "/ai-toolkit/datasets/my-dataset",
"steps": 2000,
"rank": 32,
"trigger_word": "my_style",
"test_prompts": [
"a photo of my_style artwork",
"a detailed image of my_style",
"a high quality picture of my_style",
"my_style in a cyberpunk cityscape with neon lights"
]
}
}'curl -X POST http://localhost:8190/mcp/tool \
-H "Content-Type: application/json" \
-d '{
"tool": "get-training-info",
"arguments": {
"name": "my-flux-lora"
}
}'The upload-dataset tool automatically generates .aitk_size.json metadata files required by AI Toolkit for proper dataset recognition.
curl -X POST http://localhost:8190/mcp/tool \
-H "Content-Type: application/json" \
-d '{
"tool": "upload-dataset",
"arguments": {
"dataset_name": "my-style-dataset",
"images": [
{
"filename": "image1.jpg",
"content": "base64_encoded_content_here",
"caption": "a photo of my_style artwork"
}
]
}
}'curl -X POST http://localhost:8190/mcp/tool \
-H "Content-Type: application/json" \
-d '{
"tool": "start-training",
"arguments": {
"config_name": "my-flux-lora"
}
}'# List available models first
curl -X POST http://localhost:8190/mcp/tool \
-H "Content-Type: application/json" \
-d '{
"tool": "list-exported-models",
"arguments": {}
}'
# Download a specific model
curl -X POST http://localhost:8190/mcp/tool \
-H "Content-Type: application/json" \
-d '{
"tool": "download-model",
"arguments": {
"model_path": "my-flux-lora/checkpoint-1000.safetensors"
}
}' > model_response.json
# Extract the base64 content and decode it
jq -r '.result | fromjson | .content' model_response.json | base64 -d > my_lora.safetensorsFor best results with LoRA training:
-
Image Requirements:
- PNG or JPG format
- Consistent aspect ratio (square images work best for FLUX)
- High quality, diverse examples showing different angles/contexts
- 10-50 images typically sufficient for style LoRAs
- 20-100 images for character/object LoRAs
-
Caption Format:
- Each image needs a corresponding
.txtfile with the same name - Include trigger words consistently in captions
- Be descriptive but concise
- Vary the descriptions while keeping the trigger word
- Example: "a photo of my_style artwork, digital painting, vibrant colors"
- Each image needs a corresponding
-
Dataset Structure:
datasets/my-dataset/ ├── image1.jpg ├── image1.txt ├── image2.jpg ├── image2.txt └── ... -
Caption Best Practices:
- Place trigger word at the beginning: "my_style portrait of a woman"
- Add variety: backgrounds, lighting, poses, contexts
- Include style descriptors: "my_style, oil painting style, dramatic lighting"
- Avoid repetitive captions - each should be unique
-
Flux Model Caption Guidelines:
For Flux models specifically, ensure your dataset image captions cover these essential elements:
- Trigger Word: Always include your unique trigger word (e.g., "my_style", "xyz_character")
- Subject: Clearly describe what's in the photo (e.g., "a woman", "a landscape", "a robot")
- Angle/Perspective: Specify the camera angle or viewpoint (e.g., "front view", "aerial shot", "close-up portrait", "three-quarter view")
- Environment/Setting: Describe where the subject is located (e.g., "in a forest", "urban street", "studio background", "underwater")
- Lighting: Include lighting conditions (e.g., "soft natural light", "dramatic rim lighting", "golden hour", "neon illumination")
Example Caption Structure:
"my_style portrait of a woman, three-quarter view, in a modern office, soft window lighting" "my_style robot, full body shot from below, in a cyberpunk cityscape, neon purple lighting" "my_style landscape, wide aerial view, mountain forest environment, sunset golden hour lighting"This comprehensive captioning helps Flux models better understand and reproduce your style across different contexts and conditions.
See example_training.py for a complete example of using the MCP HTTP API to:
- Create training configurations
- Upload datasets
- Start and monitor training
- Download trained models
Run the example:
python example_training.pyAI Toolkit supports training LoRAs for:
-
FLUX/Flex Models: Latest state-of-the-art models
ostris/Flex.1-alpha(recommended - publicly accessible)ostris/Flux.1-dev(if available)- Note:
black-forest-labs/FLUX.1-devrequires authentication
-
Stable Diffusion Models:
runwayml/stable-diffusion-v1-5(widely compatible)stabilityai/stable-diffusion-2-1CompVis/stable-diffusion-v1-4
-
SDXL Models and other diffusion models supported by AI Toolkit
See example_training.py for a complete example of using the MCP HTTP API to:
- Create training configurations
- Upload datasets
- Start and monitor training
- Download trained models
Run the example:
python example_training.pyAI Toolkit supports training LoRAs for:
- FLUX/Flex Models: Latest state-of-the-art models
ostris/Flex.1-alpha(recommended - publicly accessible)ostris/Flux.1-dev(if available)- Note:
black-forest-labs/FLUX.1-devrequires authentication
- Stable Diffusion Models:
runwayml/stable-diffusion-v1-5(widely compatible)stabilityai/stable-diffusion-2-1CompVis/stable-diffusion-v1-4
- SDXL Models and other diffusion models supported by AI Toolkit
LOG_LEVEL: Set logging level (default: INFO)AI_TOOLKIT_SERVER_URL: Override AI Toolkit server URLMCP_HTTP_PORT: HTTP API port (default: 8190)
The setup uses all available NVIDIA GPUs by default. To limit GPU usage, modify NVIDIA_VISIBLE_DEVICES in docker-compose.yml.
- Low VRAM Mode: Enable
low_vram: truein configurations for GPUs with <24GB VRAM - Learning Rate:
- Default is now 2e-4 (0.0002) for better training performance
- Use 1e-4 (0.0001) for more conservative training
- Use 5e-5 (0.00005) for fine-tuning existing styles
- Increase to 3e-4 (0.0003) for stubborn concepts
- Steps:
- Formula: 100 × number of images in dataset
- Example: 20 images = 2000 steps, 30 images = 3000 steps
- For single image: 100 steps is often sufficient
- Monitor test images - stop early if overfitting occurs
- Higher step counts may lead to overfitting on small datasets
- Rank Selection:
- 8-16: Simple styles or minor adjustments
- 16-32: Standard character/style LoRAs
- 32-64: Complex concepts or multiple subjects
- Test Prompts:
- Always include 4 test prompts
- 3 should be variations of your training data
- 1 should test generalization (unique scenario)
- Trigger Words:
- Use unique, non-dictionary words
- Avoid common words that might conflict
- Examples: "xyz_style", "abc_character", "def_object"
- Sample Generation:
- Enable by default to monitor training progress
- Disable with
disable_sampling: truefor faster training - Disabling saves ~20-30% training time on FLUX models
# Build all services
docker-compose buildThe build script automatically detects Linux systems and uses host network mode for better DNS resolution during the build process, which helps avoid network-related build failures.
- Check NVIDIA Docker runtime:
docker run --rm --gpus all nvidia/cuda:12.1.0-base-ubuntu22.04 nvidia-smi - Ensure ports 8675/8190 are not already in use
- Verify Docker has GPU access:
docker run --rm --gpus all nvidia/cuda:12.1.0-base-ubuntu22.04 nvidia-smi - Check for container name conflicts with other MCP services (all container names should be unique)
- Check dataset path exists and contains images with captions
- Verify GPU has sufficient VRAM (use
nvidia-smito check) - Ensure all images have corresponding
.txtcaption files - Check logs:
docker-compose logs mcp-ai-toolkit-trainer - If you see "Cannot read properties of undefined (reading 'process')", the job config format is incorrect
- If you see model not found errors, ensure you're using accessible models like
ostris/Flex.1-alpha
- Dataset must contain
.aitk_size.jsonmetadata file (automatically generated by upload-dataset tool) - The metadata file contains image dimensions and file signatures in AI Toolkit's specific format
- For manual uploads, you'll need to generate this file with the correct format:
{ "__version__": "0.1.2", "\\image1.jpg": [width, height, "filesize:hash"] }
- Check logs:
docker-compose logs mcp-ai-toolkit-trainer - Ensure AI Toolkit is healthy:
docker-compose ps - Verify MCP server is running:
docker exec mcp-ai-toolkit-trainer ps aux | grep mcp
- Ensure mcp-http-server is running:
docker-compose ps mcp-http-server - Check logs:
docker-compose logs mcp-http-server - Verify port 8190 is accessible:
curl http://localhost:8190/health
- Enable
low_vram: truein your configuration - Reduce batch size to 1
- Lower resolution to 512x512
- Reduce rank to 8 or 16
- GPU Memory: FLUX LoRA training requires ~20-24GB VRAM
- Training Time:
- SD 1.5: ~1.5-2 iter/sec on RTX 4090 (100 steps ≈ 1-2 minutes)
- FLUX/Flex: ~0.3-0.5 iter/sec on RTX 4090 (100 steps ≈ 5-10 minutes)
- 1000 steps: 30-60 minutes on RTX 4090
- 3000 steps: 90-180 minutes on RTX 4090
- Disk Space: Each training run can use 5-10GB for checkpoints
If deploying on a remote server:
-
Update MCP configuration: When containers run on a remote server, update your MCP client configuration to point to the correct host:
{ "mcpServers": { "ai-toolkit": { "command": "docker", "args": ["exec", "-i", "mcp-ai-toolkit-trainer", "python", "/app/mcp_server.py"], "env": { "AI_TOOLKIT_SERVER_URL": "http://YOUR_REMOTE_HOST:8675" } } } } -
Access services remotely:
- AI Toolkit Web UI:
http://YOUR_REMOTE_HOST:8675 - MCP HTTP API:
http://YOUR_REMOTE_HOST:8190
- AI Toolkit Web UI:
-
Monitor training: Use the Web UI or MCP tools to monitor training progress remotely
# Stop all services
docker-compose down
# Stop and remove volumes (WARNING: deletes all datasets and outputs)
docker-compose down -v- Config Storage: Configurations created via MCP are stored in the database and are NOT visible as files in the Web UI's config browser
- Model Access: Some models like
black-forest-labs/FLUX.1-devrequire authentication; use publicly accessible alternatives likeostris/Flex.1-alpha - Dataset Path: When using
create-training-config, the dataset_path should be just the dataset name (e.g., "my-dataset") not the full path
All MCP tools are fully functional with the AI Toolkit Web UI. The MCP server integrates with AI Toolkit's database and API endpoints.
Fully functional tools:
- ✅ Configuration management (create, list, get)
- ✅ Dataset upload and management (with automatic
.aitk_size.jsongeneration) - ✅ Training job control (start, stop, status)
- ✅ Model listing and downloading
- ✅ Real-time training monitoring
- ✅ Training logs retrieval
Important Notes:
- Configurations created via MCP are stored in AI Toolkit's SQLite database
- These configs are NOT visible as files in the Web UI's config browser
- The Web UI and filesystem configs are separate systems
- Jobs created via MCP's
start-trainingtool ARE visible in the Web UI - The Web UI expects configs in this exact format:
{ "job": "extension", "config": { "name": "job_name", "process": [{ "type": "ui_trainer", // ... rest of config }] }, "meta": { "name": "[name]", "version": "1.0" } } - Database initialization is handled automatically during container build
- AI Toolkit uses Prisma ORM with SQLite database (
aitk_db.db) - Jobs are stored with
job_configas a stringified JSON object - The database is initialized during container build with
npm run update_db
The MCP server generates configurations that match the Web UI's expected format:
- Job type must be
"extension"(not"train"or other values) - Trainer type must be
"ui_trainer"(not"sd_trainer") - Config is wrapped in
{job, config, meta}structure
The .aitk_size.json file uses a specific format:
- File paths use backslash prefix:
"\\image.jpg" - Signature format:
"filesize:hash"where hash is first 8 chars of MD5 as decimal - Only first 1024 bytes are used for hash calculation
- AI Toolkit runs on port 8675 (Web UI)
- MCP server runs inside the ai-toolkit container
- Optional HTTP wrapper runs on port 8190
- All containers share volumes for datasets, configs, and outputs
