Demonstrates that Apple Neural Engine (ANE) achieves significantly higher throughput with INT8 W8A8 quantization vs FP16, consistent with native INT8 datapath support.
| Method | FP16 | INT8 W8A8 | Ratio |
|---|
| import FoundationModels | |
| import Playgrounds | |
| import Foundation | |
| let session = LanguageModelSession() | |
| let start = Date() | |
| let response = try await session.respond(to: "What is Apple Neural Engine and how to use it?") | |
| let responseText = response.content // Replace 'value' with the actual property name from LanguageModelSession.Response<String> that holds the string payload. | |
| print(responseText) | |
| let end = Date() |
Demonstrates that Apple Neural Engine (ANE) achieves significantly higher throughput with INT8 W8A8 quantization vs FP16, consistent with native INT8 datapath support.
| Method | FP16 | INT8 W8A8 | Ratio |
|---|
This gist shows a working local Pi provider setup for Apple's fm serve
Chat Completions endpoint.
It supports both Apple Foundation Models exposed by the fm CLI:
fm/system: on-device Apple Foundation Model, configured as 4K contextfm/pcc: Private Cloud Compute model, configured as 32K contextThis gist shows a working local Pi provider setup for Apple's fm serve
Chat Completions endpoint.
It supports both Apple Foundation Models exposed by the fm CLI:
fm/system: on-device Apple Foundation Model, configured as 4K contextfm/pcc: Private Cloud Compute model, configured as 32K context