Patrick halbtuerke

Benchmarks for mlx-lm

The command for evaluating on MMLU Pro:

mlx_lm.evaluate --model model/repo --task mmlu_pro

The command for efficiency benchmarks:

Foundation Models

At WWDC 25 Apple opened up the on-device large-language model that powers Apple Intelligence to every iOS, iPadOS, macOS and visionOS app via a new “Foundation Models” framework. The model is a compact ~3 billion-parameter LLM that has been quantized to just 2 bits per weight, so it runs fast, offline and entirely on the user’s device, keeping data private while still handling tasks such as summarization, extraction, short-form generation and structured reasoning. ([developer.apple.com][1], [machinelearning.apple.com][2]) Below is a developer-focused English-language overview—based mainly on Apple’s own announcements, docs and WWDC sessions—followed by ready-to-paste Swift code samples.

1. What Are Apple’s On-Device Foundation Models?

Apple ships two sibling LLMs: a device-scale model (~3 B params) embedded in Apple silicon and a server-scale mixture-of-experts model that runs inside Private Cloud Compute when more heft is required. ([machinelearning.apple.com][2]) The

Which GGUF is right for me? (Opinionated)

Good question! I am collecting human data on how quantization affects outputs. See here for more information: ggml-org/llama.cpp#5962

In the meantime, use the largest that fully fits in your GPU. If you can comfortably fit Q4_K_S, try using a model with more parameters.

llama.cpp feature matrix

See the wiki upstream: https://github.com/ggerganov/llama.cpp/wiki/Feature-matrix

High level instructions for getting GPT4All working on MacOS with LLaMACPP

See nomic-ai/gpt4all for canonical source.

Environment

This walkthrough assumes you have created a folder called ~/GPT4All. Adjust the following commands as necessary for your own environment.
It's highly advised that you have a sensible python virtual environment. A conda config is included below for simplicity. Install it with conda env create -f conda-macos-arm64.yaml and then use with conda activate gpt4all.

	import ArgumentParser
	import Foundation
	import FoundationModels

	@main
	struct JazzCommand: AsyncParsableCommand {
	static var configuration = CommandConfiguration(
	commandName: "jazz",
	abstract: "A CLI tool to interpret shell tasks as natural language instructions."
	)

	{
	"name": "NuPhy Air60 V2",
	"vendorProductId": 435499605,
	"macros": [
	"{+KC_LSFT}{+KC_LGUI} {-KC_LSFT}{-KC_LGUI}",
	"{KC_LGUI} ",
	"{+KC_LSFT} ",
	"",
	"",
	"",

	import { readdirSync, readFileSync } from "fs";
	import { join } from "path";

	function listContentsAndAnalyzeDurations(directoryPath: string) {
	const filesAndFolders = readdirSync(directoryPath, { withFileTypes: true });
	let totalRecordings = 0;
	let durations_milliseconds: number[] = [];
	let totalCharacters = 0;
	let totalWords = 0;

	# Clone llama.cpp
	git clone https://github.com/ggerganov/llama.cpp.git
	cd llama.cpp

	# Build it
	make clean
	LLAMA_METAL=1 make

	# Download model
	export MODEL=llama-2-13b-chat.ggmlv3.q4_0.bin

	<svg class="gtd-wd" width="800" height="620" xmlns="http://www.w3.org/2000/svg"><style>
	.gtd-wd {
	background-color: var(--background-primary, #202020)
	}
	.gtd-wd :is(line, rect, path) {
	stroke: var(--text-normal, #dcddde);
	}
	.gtd-wd :is(text, .fill) {
	fill: var(--text-normal, #dcddde);
	}