Benchmarks for mlx-lm
The command for evaluating on MMLU Pro:
mlx_lm.evaluate --model model/repo --task mmlu_pro
The command for efficiency benchmarks:
The command for evaluating on MMLU Pro:
mlx_lm.evaluate --model model/repo --task mmlu_pro
The command for efficiency benchmarks:
| import ArgumentParser | |
| import Foundation | |
| import FoundationModels | |
| @main | |
| struct JazzCommand: AsyncParsableCommand { | |
| static var configuration = CommandConfiguration( | |
| commandName: "jazz", | |
| abstract: "A CLI tool to interpret shell tasks as natural language instructions." | |
| ) |
At WWDC 25 Apple opened up the on-device large-language model that powers Apple Intelligence to every iOS, iPadOS, macOS and visionOS app via a new “Foundation Models” framework. The model is a compact ~3 billion-parameter LLM that has been quantized to just 2 bits per weight, so it runs fast, offline and entirely on the user’s device, keeping data private while still handling tasks such as summarization, extraction, short-form generation and structured reasoning. ([developer.apple.com][1], [machinelearning.apple.com][2]) Below is a developer-focused English-language overview—based mainly on Apple’s own announcements, docs and WWDC sessions—followed by ready-to-paste Swift code samples.
Apple ships two sibling LLMs: a device-scale model (~3 B params) embedded in Apple silicon and a server-scale mixture-of-experts model that runs inside Private Cloud Compute when more heft is required. ([machinelearning.apple.com][2]) The
| { | |
| "name": "NuPhy Air60 V2", | |
| "vendorProductId": 435499605, | |
| "macros": [ | |
| "{+KC_LSFT}{+KC_LGUI} {-KC_LSFT}{-KC_LGUI}", | |
| "{KC_LGUI} ", | |
| "{+KC_LSFT} ", | |
| "", | |
| "", | |
| "", |
| import { readdirSync, readFileSync } from "fs"; | |
| import { join } from "path"; | |
| function listContentsAndAnalyzeDurations(directoryPath: string) { | |
| const filesAndFolders = readdirSync(directoryPath, { withFileTypes: true }); | |
| let totalRecordings = 0; | |
| let durations_milliseconds: number[] = []; | |
| let totalCharacters = 0; | |
| let totalWords = 0; |
Good question! I am collecting human data on how quantization affects outputs. See here for more information: ggml-org/llama.cpp#5962
In the meantime, use the largest that fully fits in your GPU. If you can comfortably fit Q4_K_S, try using a model with more parameters.
See the wiki upstream: https://github.com/ggerganov/llama.cpp/wiki/Feature-matrix
| # Clone llama.cpp | |
| git clone https://github.com/ggerganov/llama.cpp.git | |
| cd llama.cpp | |
| # Build it | |
| make clean | |
| LLAMA_METAL=1 make | |
| # Download model | |
| export MODEL=llama-2-13b-chat.ggmlv3.q4_0.bin |
See nomic-ai/gpt4all for canonical source.
~/GPT4All. Adjust the following commands as necessary for your own environment.conda env create -f conda-macos-arm64.yaml and then use with conda activate gpt4all.| <svg class="gtd-wd" width="800" height="620" xmlns="http://www.w3.org/2000/svg"><style> | |
| .gtd-wd { | |
| background-color: var(--background-primary, #202020) | |
| } | |
| .gtd-wd :is(line, rect, path) { | |
| stroke: var(--text-normal, #dcddde); | |
| } | |
| .gtd-wd :is(text, .fill) { | |
| fill: var(--text-normal, #dcddde); | |
| } |