Skip to content

Instantly share code, notes, and snippets.

View Blaizzy's full-sized avatar
🏠
Working from home

Prince Canuma Blaizzy

🏠
Working from home
View GitHub Profile
"""
Benchmark TriAttention on MATH 500 — matching the paper's evaluation protocol.
Paper settings: max_tokens=32768, temp=0.6, top_p=0.95, budget=512/1024/2048
We use max_tokens=4096 for practical runtime on Apple Silicon.
USAGE
python bench_triattention_math.py \
--model /tmp/gemma-4-26b-a4b-it-5bit \
--calib /tmp/gemma4_26b_5bit_calib.safetensors \
"""
Benchmark TurboQuant (TBQ) vs baseline on MM-NIAH (Multimodal Needle-in-a-Haystack).
INSTALL
pip install -U mlx-vlm
# or
uv pip install -U mlx-vlm
SETUP — Extract images (one-time)
huggingface-cli download OpenGVLab/MM-NIAH mm_niah_val/images.tar.gz --repo-type dataset
"""Benchmark TurboQuant vs baseline on LongBench-v2.
Usage:
python scripts/bench_longbench_v2.py --model google/gemma-4-e4b-it --num-samples 10 --max-tokens-ctx 260000
python scripts/bench_longbench_v2.py --model google/gemma-4-26b-a4b-it --num-samples 5 --max-tokens-ctx 128000 --kv-bits 4
"""
import argparse
import importlib
import time
@Blaizzy
Blaizzy / qwen3_tts_benchmark.py
Last active March 2, 2026 21:23
Qwen3-TTS Benchmark: TTFB, inter-chunk latency, throughput, and batch generation metrics for mlx-audio
#!/usr/bin/env python3
"""
Benchmark for Qwen3-TTS: measures TTFB, inter-chunk latency, and throughput.
Usage:
# Sequential only (short/medium/long)
python qwen3_tts_benchmark.py --model mlx-community/Qwen3-TTS-12Hz-0.6B-Base-bf16
# Sequential + batched (1,2,3,4,8)
python qwen3_tts_benchmark.py --batch-size 1 2 3 4 8
#!/usr/bin/env python3
"""
Benchmark for Qwen3-TTS: measures TTFB, inter-chunk latency, and throughput.
Usage:
python benchmarks/qwen3_tts_benchmark.py
python benchmarks/qwen3_tts_benchmark.py --model mlx-community/Qwen3-TTS-0.6B-bf16
python benchmarks/qwen3_tts_benchmark.py --num-trials 3 --streaming-interval 1.0
python benchmarks/qwen3_tts_benchmark.py --prompts short medium long
"""
@Blaizzy
Blaizzy / BENCHMARK_RESULTS.md
Last active February 28, 2026 14:00
Qwen3-TTS Batch Generation Benchmark Results (4-bit, 6-bit, 8-bit, bf16)

Qwen3-TTS Batch Generation Benchmark Results

Voice: serena Device: Apple Silicon (MLX) Date: 2025-02-28


Cross-Model Comparison (batch=4, short prompt)

@Blaizzy
Blaizzy / BaseConfiguration.swift
Created February 23, 2026 21:36
Get started with MLX-Swift with Qwen3 port
//
// BaseConfiguration.swift
// mlx-test
//
// Created by Prince Canuma on 29/12/25.
//
import Foundation
import MLX
mlx_audio.tts.generate \
--model mlx-community/chatterbox-turbo-fp16 \
--text 'Abstract
The dominant sequence transduction models are based on complex recurrent or
convolutional neural networks that include an encoder and a decoder. The best
performing models also connect the encoder and decoder through an attention
mechanism. We propose a new simple network architecture, the Transformer,
based solely on attention mechanisms, dispensing with recurrence and convolutions
entirely. Experiments on two machine translation tasks show these models to
be superior in quality while being more parallelizable and requiring significantly
@Blaizzy
Blaizzy / convert_weights.py
Created December 18, 2025 10:49
Chattebox Turbo MLX port
#!/usr/bin/env python3
# Copyright (c) 2025 Resemble AI
# MIT License
# Weight conversion script: PyTorch -> MLX
"""
Converts Chatterbox Turbo weights from PyTorch to MLX format.
Usage:
python convert_weights.py --output model.safetensors
@Blaizzy
Blaizzy / tokenizers_utils.py
Last active January 13, 2026 17:15
Decode Stream
import json
from functools import partial
from json import JSONDecodeError
from typing import List
from transformers import AutoTokenizer
import tokenizers
REPLACEMENT_CHAR = "\ufffd"