Awni Hannun awni

Setup the repo

git clone git@github.com:filipstrand/mflux.git
cd mflux && pip install -r requirements.txt

Make a run script

Name this anything, maybe flux.py. Make sure to update the two paths marked below.

Use Lazy Loading to Reduce Peak Memory Use

Recall, MLX is lazy. No actual computation happens until you explicitly or implicitly evaluate the graph. Even loading arrays from a file is lazy:

weights = mx.load("model.safetensors")

Avoid Overly Frequent Graph Evaluations

MLX is lazy. No actual computation happens until you explicitly or implicitly evaluate the graph. Here are some ways that can happen:

Explicit call to mx.eval
Call a.item() on a scalar array
Convert an array to NumPy, i.e. np.array(a)
Print an array

MLX LM with the OpenAI Python Package

1. Install

Install MLX LM and openai:

pip install mlx-lm openai

Ops with Data Dependent Shapes

This is a short article on a common type of not-yet-supported operation in MLX: ops where the output shape depends on the input data. Here's an outline:

An introduction to these operations, followed by an explanation of why they are challenging to implement efficiently.
A discussion on when and how to work-around these missing operations with a couple of examples.

	import numpy as np
	import mlx.core as mx
	import matplotlib.pyplot as plt
	from matplotlib.animation import FuncAnimation

	import tqdm


	def conway(a: mx.array):
	source = """

	import os
	import mlx.core as mx
	from mlx_lm import load, generate

	filename = os.path.join(os.path.dirname(mx.__file__), "core/__init__.pyi")
	with open(filename, 'r') as fid:
	prompt = fid.read()
	prompt += "\nHow do you write a self-attention layer using the above API in MLX?"

	model, tokenizer = load("mlx-community/meta-Llama-3.1-8B-Instruct-4bit")

	"""
	A minimal, fast example generating text with Llama 3.1 in MLX.

	To run, install the requirements:

	pip install -U mlx transformers fire

	Then generate text with:

	python l3min.py "How tall is K2?"

	# Requires:
	# pip install pyobjc-framework-Metal
	import numpy as np
	import Metal

	# Get the default GPU device
	device = Metal.MTLCreateSystemDefaultDevice()

	# Make a command queue to encode command buffers to
	command_queue = device.newCommandQueue()

	from typing import Callable, Tuple

	import operator
	from functools import reduce
	from itertools import product
	import mlx.core as mx

	def _interpolate(
	x: mx.array, scale_factor: Tuple, indices_fn: Callable, align_corners: bool = False
	):