---
title: What the Apple Neural Engine and Google's TPU tell us about the next decade of inference
description: A deep-research report on neural processors. How they differ from GPUs, and what Apple and Google have built across the ANE and TPU lineage.
doc_version: "1.0"
last_updated: 2026-06-11
slug: ane-tpu-deep-research
kind: essay
status: active
date: 2026-06-08
summary: A deep-research report on neural processors. What they are, how they differ from GPUs, and what Apple and Google have built across the Apple Neural Engine (ANE) and the TPU lineage.
tags:
  - apple-neural-engine
  - tpu
  - on-device-ai
  - inference
  - core-ml
  - foundation-models
  - deep-research
  - npu
---

<div class="repost-note">
<p class="repost-note-label">A note from Jesse</p>

I have been excited about NPUs since [Alex McNamara](https://www.linkedin.com/in/alexmcnamara/) first proposed using them for our on-device AI work at Orion Labs in 2019. Obviously I am a big believer in moving intelligence as close to the edge as possible. Many people are surprised to learn that there are supercomputer level Trillion Operation Per Second dedicated processors in most iPhones.

I have returned to this hands-on recently in my contributions to [maclocal-api](https://github.com/scouzi1966/maclocal-api), an open-source project that puts Apple's on-device models behind an OpenAI-compatible HTTP API. Over the last several weeks I have added support for Apple's on-device [speech transcription and text-to-speech](https://github.com/scouzi1966/maclocal-api/pull/113), an [embeddings endpoint backed by Apple's NaturalLanguage framework](https://github.com/scouzi1966/maclocal-api/pull/119), and a set of [Vision endpoints for OCR, barcodes, image classification, and saliency](https://github.com/scouzi1966/maclocal-api/pull/114).

When I talk to people about the research and commits I have made, I keep having to explain the Apple Neural Engine (ANE) and TPUs each time. I realized I should just run a Claude deep-research pass and share the output.

</div>

## What a neural processor actually is

<figure style="margin:1.5rem 0 2rem;background:#ffffff;border:1px solid var(--border-main);border-radius:8px;padding:1.25rem;color:#1f2937">
<img src="/research/cpu-gpu-npu-comparison.svg" alt="Illustrative side-by-side comparison of how a CPU, a GPU, and an NPU allocate their on-chip area, and what each is best at. The CPU dedicates roughly half its area to control and scheduling logic, with caches and a small number of wide ALUs, and is best at branchy logic, control flow, and the operating system. The GPU dedicates most of its area to thousands of small parallel cores and is best at graphics, simulation, and large-batch model training. The NPU dedicates the majority of its area to a dense multiply-accumulate array and on-chip SRAM, and is best at low-power, always-on, on-device neural network inference." loading="lazy" style="width:100%;height:auto;display:block" />
<figcaption style="font-size:0.85em;opacity:0.7;margin-top:0.75rem;text-align:center;color:#4b5563">How a CPU, a GPU, and an NPU spend their silicon, and what each is best at. Schematic and illustrative; real chips vary widely. The point is the design intent.</figcaption>
</figure>

A neural processor (commonly called an NPU, for neural processing unit) is a piece of silicon designed for one specific job: running the math at the heart of a neural network. That math is overwhelmingly multiply-accumulate operations on small matrices and tensors of fixed-point or low-precision floating-point numbers. An NPU is built around that one job and very little else.

<figure style="margin:1.5rem 0 2rem;background:#ffffff;border:1px solid var(--border-main);border-radius:8px;padding:1.25rem;color:#1f2937">
<img src="/research/tensor-math.svg" alt="A visual explanation of the matrix multiplication operation that dominates neural network inference. One row of input matrix A is multiplied element-wise with one column of input matrix B, and the products are summed to produce a single cell of the output matrix C. The example shows the row 2, 3, 1, 4 multiplied by the column 5, 2, 7, 1, producing 27. A neural network does billions of these operations per inference pass; an NPU is silicon built to do them in parallel with as little overhead as possible." loading="lazy" style="width:100%;height:auto;display:block" />
<figcaption style="font-size:0.85em;opacity:0.7;margin-top:0.75rem;text-align:center;color:#4b5563">What the math actually looks like. One row of A, one column of B, multiplied element-wise and summed into one cell of C. That is a single multiply-accumulate. An NPU runs thousands of these in parallel.</figcaption>
</figure>

The clearest way to see what an NPU is is to compare it to the three things it sits next to in a modern system.

**CPU.** A general-purpose processor. It runs arbitrary control flow, branchy logic, the operating system, and everything else. It is the most flexible compute in the system and the least efficient for tensor math, because almost none of its transistors are dedicated to wide parallel multiplies.

**GPU.** A massively parallel processor originally built for graphics, retrofitted very successfully for machine learning. GPUs are good at the same wide parallel arithmetic that neural networks need, which is why they powered the deep-learning era. They are programmable through general-purpose shading languages (Metal, CUDA, Vulkan compute) and can run almost anything that maps to thousands of parallel threads. Training large models on GPUs at scale is what built modern AI.

**NPU.** A fixed-function accelerator built for one shape of computation: dense low-precision matrix multiplies and convolutions, typically with built-in support for the activations, normalizations, and quantization steps that neural networks need. NPUs are not designed to be fully programmable. They are designed to do the inner loop of an inference pass at very low power and very high throughput.

**The trade-off** is straightforward. A GPU can run a neural network and a fluid simulation and a fragment shader. An NPU can run a neural network. In exchange for giving up that generality, an NPU spends nearly all of its transistor budget on the multiplier arrays, the on-chip SRAM that feeds them, and the data-movement plumbing between the two. For the workloads it is built for, the result is more operations per second per watt than a GPU of comparable area.

That last phrase, *per watt*, is where NPUs matter most. A datacenter GPU can draw 400 to 700 watts; a phone or a watch cannot. If you want a model to run continuously on a battery-powered device, listening for a wake word or watching a camera feed, the power envelope is measured in milliwatts, not watts. That budget is what makes a fixed-function accelerator the right tool: you give up programmability to spend the silicon on doing the one job efficiently.

Both of the chips this report is about, the Apple Neural Engine and Google's TPU, are NPUs in this sense. Both are built around large arrays of multiply-accumulate units. Both are paired with software runtimes that handle the data movement, quantization, and model compilation that turn a high-level model description into something the silicon can execute. They sit at different points on the size and power curve, one designed for the inside of a phone, the other for the inside of a datacenter rack, but the underlying design philosophy is the same: dedicate the silicon to the math the network actually does.

For a longer, more visual treatment of how systolic arrays and dedicated tensor silicon work, the videos in the "Watch" section below are good starting points.

## A. What the Apple Neural Engine actually is

The ANE has shipped in every Apple-designed system-on-chip since the A11 Bionic in September 2017, the chip Apple introduced alongside the iPhone X. At launch Apple described the original ANE as a dual-core block capable of 600 billion operations per second, used initially to power Face ID, Animoji, and real-time image processing ([CNBC, Sep 12 2017](https://www.cnbc.com/2017/09/12/apple-unveils-a11-bionic-neural-engine-ai-chip-in-iphone-x.html)).

It is a fixed-function NPU, not a general-purpose accelerator. It is optimized for FP16 convolutional and matrix-multiplication workloads, the operations that dominate vision models and modern transformer inference.

**Throughput growth, 2017 to 2021.** Apple's published peak figures show a roughly 26-fold jump in four years: the A11's 0.6 TFLOPS gave way to a 16-core ANE in the A15 Bionic (2021) that Apple rates at 15.8 TFLOPS. Treat these as vendor peak numbers against vendor reference benchmarks; they are accurate as published but should be framed that way.

**The distilbert case study.** In 2022, Apple's machine learning research team published a reference implementation showing how to adapt a Hugging Face transformer model to run efficiently on the ANE. With the optimizations applied, Apple reported the model running up to 10× faster and using 14× less peak memory, with an end-to-end latency of 3.47 ms at 0.454 W on an iPhone 13 (sequence length 128, batch size 1) ([Deploying Transformers on the Apple Neural Engine, Apple ML Research, 2022](https://machinelearning.apple.com/research/neural-engine-transformers); reference code: [apple/ml-ane-transformers](https://github.com/apple/ml-ane-transformers); optimized model weights: [apple/ane-distilbert-base-uncased-finetuned-sst-2-english on Hugging Face](https://huggingface.co/apple/ane-distilbert-base-uncased-finetuned-sst-2-english)).

**The software stack predates the LLM wave.** Most of the framing around on-device AI treats the current moment as new. Apple's developer surface for it is not. The relevant milestones, in order:

- **Core ML (2017).** Apple's high-level inference framework, the public path to running ML models on Apple Silicon. Documentation at [developer.apple.com/documentation/coreml](https://developer.apple.com/documentation/coreml).
- **Natural Language framework (iOS 12, WWDC 2018).** On-device language identification, tokenization, lemmatization, part-of-speech tagging, and named-entity recognition ([Introducing Natural Language Framework, WWDC 2018, Session 713](https://nonstrict.eu/wwdcindex/wwdc2018/713/)).
- **WWDC 2019 additions.** On-device sentiment analysis across seven languages and word embeddings, used together for in-app search and similarity ([Advances in Natural Language Framework, WWDC 2019, Session 232](https://developer.apple.com/videos/play/wwdc2019/232/); session transcript at [ASCIIwwdc](https://asciiwwdc.com/2019/sessions/232)). Apple's documentation describes the sentiment classifier as "hardware activated" on supported devices; that is Apple's phrase and it does not explicitly name the ANE, so do not upgrade it to a guaranteed-ANE claim.
- **MLX (2023).** Apple's array framework for machine learning on Apple Silicon, with a unified memory model and a NumPy-like Python API. This is the path for training and fine-tuning workflows on Apple hardware, distinct from Core ML's inference focus ([ml-explore/mlx](https://github.com/ml-explore/mlx)).
- **Foundation Models framework (iOS 26, WWDC 2025).** A Swift API for tapping Apple Intelligence's on-device ~3B-parameter model in roughly three lines of code, with guided generation, streaming, and tool calls ([Foundation Models, Apple Developer Documentation](https://developer.apple.com/documentation/FoundationModels); [Meet the Foundation Models framework, WWDC 2025, Session 286](https://developer.apple.com/videos/play/wwdc2025/286/); [Deep dive into the Foundation Models framework, WWDC 2025, Session 301](https://developer.apple.com/videos/play/wwdc2025/301/)).

<figure style="margin:1.5rem 0">
<iframe src="https://www.youtube-nocookie.com/embed/mJMvFyBvZEk" title="WWDC25: Meet the Foundation Models framework — Apple" loading="lazy" allowfullscreen style="width:100%;aspect-ratio:16/9;border:0;border-radius:8px;display:block"></iframe>
<figcaption style="font-size:0.85em;opacity:0.7;margin-top:0.5rem">Apple's introduction to the Foundation Models framework in iOS 26 (WWDC 2025, Session 286). The three-line "hello world" for on-device LLM use lives at around the four-minute mark. Mirror on the <a href="https://www.youtube.com/watch?v=mJMvFyBvZEk">official Apple YouTube channel</a>.</figcaption>
</figure>

Nine years of public, mature on-device ML infrastructure. The headlines moved on; the stack kept growing.

<figure style="margin:1.5rem 0">
<iframe src="https://www.youtube-nocookie.com/embed/p_hyo2FRil4" title="WWDC24: Explore machine learning on Apple platforms — Apple" loading="lazy" allowfullscreen style="width:100%;aspect-ratio:16/9;border:0;border-radius:8px;display:block"></iframe>
<figcaption style="font-size:0.85em;opacity:0.7;margin-top:0.5rem">For a current end-to-end orientation across Core ML, MLX, Create ML, and the Vision, Natural Language, and Speech frameworks, Apple's WWDC 2024 overview is the single best place to start. Mirror on the <a href="https://www.youtube.com/watch?v=p_hyo2FRil4">official Apple YouTube channel</a>.</figcaption>
</figure>

## B. How Apple has chosen to expose it

Apple's developer surface for the ANE is shaped by a clear design choice: the operating system, not the application, decides where neural work runs. The result is a runtime that lets app developers ship machine learning features without having to reason about which compute unit to target.

**Core ML is the public path, and it is a scheduler.** At runtime, Core ML inspects a model and decides, layer by layer, whether each one runs best on the CPU, the GPU, or the ANE. The developer ships the model; Apple's runtime handles the placement. As Apple silicon evolves, the same model gets faster without the app having to ship an update, because the scheduler knows what the new hardware can do.

The configuration surface reflects that philosophy. `MLComputeUnits` exposes four preferences: `all`, `cpuOnly`, `cpuAndGPU`, and `cpuAndNeuralEngine` ([MLComputeUnits, Apple Developer Documentation](https://developer.apple.com/documentation/coreml/mlcomputeunits)). The defaults are tuned for the common case. The narrower options exist for the cases where a developer has a reason to constrain the runtime, for example to keep the GPU free for graphics work or to validate behavior under a specific compute path ([MLComputeUnits.cpuAndNeuralEngine, Apple Developer Documentation](https://developer.apple.com/documentation/coreml/mlcomputeunits/cpuandneuralengine)).

The most concrete public illustration is Apple's own WWDC 2022 session, "Optimize your Core ML usage." Presenters run a YOLOv3 object-detection model under the `all` setting and use Xcode's performance report to inspect, layer by layer, where Core ML placed the work. The result for that model: 54 layers on the GPU and 32 on the ANE. The developer set a single preference; the runtime made 86 informed placement decisions on their behalf, including the ones that fit the ANE's matrix-multiply units best ([Optimize your Core ML usage, WWDC 2022, Session 10027](https://developer.apple.com/videos/play/wwdc2022/10027/)).

<figure style="margin:1.5rem 0">
<iframe src="https://www.youtube-nocookie.com/embed/THXq071qZ6E" title="WWDC22: Optimize your Core ML usage — Apple" loading="lazy" allowfullscreen style="width:100%;aspect-ratio:16/9;border:0;border-radius:8px;display:block"></iframe>
<figcaption style="font-size:0.85em;opacity:0.7;margin-top:0.5rem">Apple's "Optimize your Core ML usage" from WWDC 2022 (Session 10027). The YOLOv3 layer-distribution walkthrough is the clearest existing demonstration of how Core ML decides where each layer of a model runs. Mirror on the <a href="https://www.youtube.com/watch?v=THXq071qZ6E">official Apple YouTube channel</a>.</figcaption>
</figure>

**Training has its own path.** Core ML and the ANE focus on inference. For training and fine-tuning on Apple Silicon, Apple ships MLX, a NumPy-style array framework with a unified memory model that targets the CPU and GPU directly, and PyTorch on Apple Silicon runs through Metal Performance Shaders. The split is deliberate: the ANE is built for the inference inner loop, and the GPU is the natural home for backprop. Apple's WWDC 2025 sessions on MLX go deep on how to use it for LLM fine-tuning on consumer Macs ([Get started with MLX for Apple silicon, WWDC 2025, Session 315](https://developer.apple.com/videos/play/wwdc2025/315/); [Explore large language models on Apple silicon with MLX, Session 298](https://developer.apple.com/videos/play/wwdc2025/298/)).

**A growing community is exploring the ANE in the open.** While Apple's published material on the ANE itself is limited and the official channels for working with it are intentionally high-level, an active community of independent researchers and Apple itself have been building tools, documentation, benchmarks, and reference implementations around the chip. The most useful entry points to that work:

- **[hollance/neural-engine](https://github.com/hollance/neural-engine)** — Matthijs Hollemans's comprehensive community documentation of ANE behavior, performance characteristics, and supported operations. The single best existing resource on the ANE.
- **[mdaiter/ane](https://github.com/mdaiter/ane)** — early reverse engineering with working Python and Objective-C samples, documenting the ANECompiler framework and IOKit dispatch.
- **[eiln/ane](https://github.com/eiln/ane)** — a reverse-engineered Linux driver for the ANE from the Asahi Linux project, providing insight into the kernel-level interface.
- **[apple/ml-ane-transformers](https://github.com/apple/ml-ane-transformers)** — Apple's own reference implementation of transformers optimized for the ANE, confirming design patterns like channel-first layout and a preference for 1×1 convolutions.
- **[Anemll/Anemll](https://github.com/Anemll/Anemll)** (pronounced "animal," for "Artificial Neural Engine Machine Learning Library") — an open-source project focused on running large language models directly on the ANE, with a single-file conversion-and-inference pipeline for LLaMA, Qwen, Qwen 2.5, and Gemma 3 architectures, plus a companion benchmarking tool at [Anemll/anemll-bench](https://github.com/Anemll/anemll-bench).
- **[maderix/ANE](https://github.com/maderix/ANE)** — research into training on the M4 ANE, building a custom compute graph with a backward pass by talking to the lower-level frameworks inside `AppleNeuralEngine.framework`. The author is careful about the caveats: the proof of concept runs at roughly 5–9% of peak ANE utilization, and the methodology depends on undocumented APIs ([Inside the M4 Apple Neural Engine, Part 1: Reverse Engineering](https://maderix.substack.com/p/inside-the-m4-apple-neural-engine)).

The natural reading is that Apple is choosing the rate at which to open up programmability, and the hardware will let them go further when they decide to. In the meantime the community has done a lot of careful work to map what is already there.

The overall shape: Apple has built a mature on-device ML stack where the runtime handles the hard scheduling problem so applications do not have to. That choice has shipped a lot of working machine learning into a lot of pockets.

## C. The TPU lineage and why it matters

Step back from Apple for a moment and the broader picture is that dedicated tensor silicon has been in production at industrial scale for over a decade, and the architectural ideas have been migrating from datacenters to the edge.

**The first-generation TPU.** Google's first TPU went into production inside Google datacenters in 2015. Norman Jouppi and colleagues described it in detail at ISCA 2017 in a paper that has become the standard reference: a custom ASIC built around a 256×256 systolic-array matrix-multiply unit, with 65,536 8-bit MACs, a peak throughput of 92 TeraOps/second, and 28 MiB of software-managed on-chip memory ([In-Datacenter Performance Analysis of a Tensor Processing Unit, Jouppi et al., ISCA 2017, arXiv:1704.04760](https://arxiv.org/abs/1704.04760); [ACM proceedings entry](https://dl.acm.org/doi/10.1145/3079856.3080246)).

**Performance, with caveats.** The same paper reports the first-generation TPU as 15× to 30× faster than a contemporary Intel Haswell CPU and a contemporary NVIDIA K80 GPU on Google's production inference workloads, with 30× to 80× better performance-per-watt. These are Google-published numbers against Google's reference workloads; they were peer-reviewed and have aged well, but they are vendor benchmarks and should be framed that way. For a retrospective tour across four TPU generations from one of the paper's co-authors, see David Patterson's [*A Decade of Machine Learning Accelerators: Lessons Learned and Carbon Footprint*](https://www.cs.ucla.edu/wp-content/uploads/cs/PATTERSON-10-Lessons-4-TPU-gens-CO2e-45-minutes.pdf) (slide deck, 2022), and the authors' own [*Ten Lessons from Three Generations Shaped Google's TPUv4i*](https://dl.acm.org/doi/abs/10.1109/ISCA52012.2021.00010) (ISCA 2021).

**The lineage extends to the edge.** Google has carried the tensor-accelerator design philosophy down into wearable- and hearable-class hardware. Coral NPU, announced in 2025, is an open-source 32-bit RISC-V design with a vector co-processor implementing the RVV v1.0 vector ISA and a quantized outer-product MAC engine that processes 8-bit operations into 32-bit results. It is targeted at ultra-low-power, always-on edge AI, including smart watches and AR glasses ([Coral NPU datasheet, Google for Developers](https://developers.google.com/coral/guides/hardware/datasheet); [Introducing Coral NPU, Google Developers Blog](https://developers.googleblog.com/en/introducing-coral-npu-a-full-stack-platform-for-edge-ai/); source: [google-coral/coralnpu on GitHub](https://github.com/google-coral/coralnpu)).

<figure style="margin:2rem 0">
<img src="/research/coral-npu-architecture.png" alt="Google's official Coral NPU architecture diagram. A scalar RISC-V core sits at the top, dispatching to a vector execution unit (implementing RVV v1.0) and a matrix execution unit (the outer-product multiply-accumulate engine). All three share access to tightly coupled instruction and data memory and an external memory interface." loading="lazy" style="width:100%;height:auto;border:1px solid var(--border-main);border-radius:8px;background:var(--bg-main)" />
<figcaption style="font-size:0.85em;opacity:0.65;margin-top:0.5rem;text-align:center">Coral NPU architecture: a scalar RISC-V core dispatches to a vector execution unit and a matrix (MAC) execution unit, sharing a tightly coupled memory hierarchy. Source: <a href="https://developers.google.com/coral/guides/architecture">Coral NPU Architecture overview, Google for Developers</a> (Apache 2.0).</figcaption>
</figure>

The arc from a 2015 datacenter ASIC to a 2025 open-source RISC-V NPU for hearables is the through-line: matrix-multiply silicon got faster, smaller, more efficient, and ended up everywhere.

## D. What this tells us about the next decade of inference

**Inference is migrating outward from the datacenter.** Google's first-generation TPU went into production inside Google datacenters in 2015 ([Jouppi et al., ISCA 2017, arXiv:1704.04760](https://arxiv.org/abs/1704.04760)). Ten years later, the same company has published a 32-bit RISC-V NPU targeted at smart watches and AR glasses ([Introducing Coral NPU, Google Developers Blog](https://developers.googleblog.com/en/introducing-coral-npu-a-full-stack-platform-for-edge-ai/); source at [google-coral/coralnpu](https://github.com/google-coral/coralnpu)). Apple has shipped an NPU in every iPhone since the A11 Bionic in September 2017 ([CNBC, Sep 12 2017](https://www.cnbc.com/2017/09/12/apple-unveils-a11-bionic-neural-engine-ai-chip-in-iphone-x.html)), and at WWDC 2025 made a roughly 3-billion-parameter on-device model available to any third-party developer through the Foundation Models framework ([Foundation Models, Apple Developer Documentation](https://developer.apple.com/documentation/FoundationModels); [Meet the Foundation Models framework, WWDC 2025, Session 286](https://developer.apple.com/videos/play/wwdc2025/286/)). Qualcomm describes its current mobile NPUs as "designed from the ground up for accelerating AI inference at low power" and ships them across the Snapdragon line ([Hexagon NPU, Qualcomm](https://www.qualcomm.com/products/technology/processors/hexagon); [Snapdragon for AI on-device, Qualcomm](https://www.qualcomm.com/products/technology/artificial-intelligence)). Across three independent vendors, the public record points in the same direction: a growing share of inference work runs on dedicated silicon outside the datacenter.

**Most of the headroom available on chips already in pockets is in software.** Apple's developer-facing path to the ANE has widened in steady increments since 2017: Core ML in 2017 ([Core ML, Apple Developer Documentation](https://developer.apple.com/documentation/coreml)), the Natural Language framework in 2018 ([WWDC 2018, Session 713](https://nonstrict.eu/wwdcindex/wwdc2018/713/)), MLX in 2023 ([ml-explore/mlx](https://github.com/ml-explore/mlx)), the Foundation Models framework in 2025 ([Deep dive into the Foundation Models framework, WWDC 2025, Session 301](https://developer.apple.com/videos/play/wwdc2025/301/)), and updated MLX guidance for LLM work on Apple silicon at the same WWDC ([Get started with MLX for Apple silicon, WWDC 2025, Session 315](https://developer.apple.com/videos/play/wwdc2025/315/); [Explore large language models on Apple silicon with MLX, WWDC 2025, Session 298](https://developer.apple.com/videos/play/wwdc2025/298/)). Independent research suggests the silicon can do more than the current public path expresses: the maderix proof of concept runs at 5–9% of peak ANE utilization through reverse-engineered private APIs ([Inside the M4 Apple Neural Engine, Part 1, maderix](https://maderix.substack.com/p/inside-the-m4-apple-neural-engine); benchmarks in [Part 2](https://maderix.substack.com/p/inside-the-m4-apple-neural-engine-615); code at [maderix/ANE on GitHub](https://github.com/maderix/ANE)). The gap between what the chip can do and what the developer-facing path can reach is large enough that closing it is itself a source of performance. The same pattern shows up on the datacenter side: Patterson's retrospective across four TPU generations attributes much of the improvement to compiler and software-stack work rather than process shrinks alone ([*A Decade of Machine Learning Accelerators*, David Patterson, 2022](https://www.cs.ucla.edu/wp-content/uploads/cs/PATTERSON-10-Lessons-4-TPU-gens-CO2e-45-minutes.pdf); [*Ten Lessons from Three Generations Shaped Google's TPUv4i*, Jouppi et al., ISCA 2021](https://dl.acm.org/doi/abs/10.1109/ISCA52012.2021.00010)).

**The fixed-function trade has held up under scrutiny.** Google's published numbers for the first-generation TPU, 15× to 30× faster than its CPU and GPU contemporaries at 30× to 80× better performance-per-watt on Google production inference workloads, were peer-reviewed in 2017 and have been revisited in the literature several times since ([Jouppi et al., ISCA 2017](https://arxiv.org/abs/1704.04760); [retrospective, Jouppi 2023](https://bpb-us-w2.wpmucdn.com/sites.coecis.cornell.edu/dist/7/587/files/2023/06/Jouppi_2017_In_Datacenter.pdf)). Apple's distilbert reference implementation reports 3.47 ms end-to-end latency at 0.454 W on an iPhone 13 with the ANE-tuned model ([Deploying Transformers on the Apple Neural Engine, Apple ML Research, 2022](https://machinelearning.apple.com/research/neural-engine-transformers); code at [apple/ml-ane-transformers](https://github.com/apple/ml-ane-transformers); model weights at [apple/ane-distilbert-base-uncased-finetuned-sst-2-english](https://huggingface.co/apple/ane-distilbert-base-uncased-finetuned-sst-2-english)). The shape of workload these chips are built for, dense low-precision matrix multiplies and convolutions, has been stable across the ten-year window between those two results.

**The application-facing API for on-device models is converging on something close to a library call.** Apple's Foundation Models framework reduces invoking a 3B-parameter model from Swift to roughly three lines of code, with the "hello world" at around the four-minute mark of the WWDC 2025 introduction ([WWDC 2025, Session 286](https://developer.apple.com/videos/play/wwdc2025/286/)). Google's mobile path takes the same shape: Gemini Nano runs through Android's AICore service, and applications reach it through ML Kit's GenAI APIs and the Google AI Edge SDK ([Gemini Nano via AICore, Android Developers](https://developer.android.com/ai/aicore); [ML Kit GenAI APIs, Google for Developers](https://developers.google.com/ml-kit/genai); [Google AI Edge SDK overview](https://ai.google.dev/edge)). The cross-platform on-device path through ONNX Runtime exposes a single inference API that dispatches to whichever NPU or GPU is present on the host ([ONNX Runtime documentation](https://onnxruntime.ai/docs/); [Hugging Face transformers.js for in-browser inference](https://huggingface.co/docs/transformers.js)). The 2017-to-2025 arc across Core ML, MLX, AICore, and ONNX Runtime is the abstraction layer rising in roughly the same direction across vendors.

**Open access to the silicon is widening on both vendor and community sides.** Google has published Coral NPU as an open-source 32-bit RISC-V design with a vector co-processor implementing the [RVV v1.0](https://github.com/riscv/riscv-v-spec) vector ISA and a quantized outer-product MAC engine ([Coral NPU datasheet, Google for Developers](https://developers.google.com/coral/guides/hardware/datasheet); source at [google-coral/coralnpu](https://github.com/google-coral/coralnpu)). On the Apple side, several independent research projects have produced public tooling for studying and exercising the ANE, including [hollance/neural-engine](https://github.com/hollance/neural-engine), [Anemll/Anemll](https://github.com/Anemll/Anemll) with its [companion benchmark suite](https://github.com/Anemll/anemll-bench), [maderix/ANE](https://github.com/maderix/ANE), [mdaiter/ane](https://github.com/mdaiter/ane), and the Asahi Linux project's reverse-engineered Linux driver at [eiln/ane](https://github.com/eiln/ane). On the Apple side, the published interface remains Core ML's coarse-grained scheduler ([MLComputeUnits, Apple Developer Documentation](https://developer.apple.com/documentation/coreml/mlcomputeunits)); the community work above is what has made finer-grained study of the chip possible in public.

## Summary of the hardest numbers

The three figures worth carrying out of this report, all from primary vendor or peer-reviewed sources:

1. **ANE throughput grew roughly 26× in four years.** From 0.6 TFLOPS in the A11 Bionic (2017) to 15.8 TFLOPS in the 16-core ANE of the A15 Bionic (2021).
2. **Apple's distilbert reference implementation runs up to 10× faster and uses 14× less peak memory after ANE optimizations**, with 3.47 ms latency at 0.454 W on an iPhone 13 ([Apple ML Research, 2022](https://machinelearning.apple.com/research/neural-engine-transformers)).
3. **The first-generation TPU was 15× to 30× faster and 30× to 80× more performance-per-watt than its CPU and GPU contemporaries** on Google's production inference workloads ([Jouppi et al., ISCA 2017](https://arxiv.org/abs/1704.04760)).

## Caveats

- Apple and Google performance numbers in this post are vendor benchmarks against vendor-chosen reference workloads. They are accurate as published, and where they have been independently scrutinized (the TPU paper most of all) they have held up. Treat them as the floor of what the architecture can do under conditions the vendor chose, not as a guarantee for arbitrary models.
- Apple's "hardware activated" phrasing around on-device sentiment analysis is Apple's own. It implies hardware acceleration on supported devices and does not explicitly name the ANE. Reporting it without that nuance would be an upgrade of the claim Apple actually made.
- The argument that the constraint on ANE adoption is software rather than hardware comes from the maderix project. The proof of concept supporting it runs at 5–9% of peak ANE utilization and depends on reverse-engineered private APIs. The hardware-vs-software framing is a reasonable inference from that work; it should be attributed to its source, not asserted as a settled finding.

## Sources

Primary sources, in order of first appearance:

- [Apple unveils A11 Bionic neural engine AI chip in iPhone X — CNBC, Sep 12 2017](https://www.cnbc.com/2017/09/12/apple-unveils-a11-bionic-neural-engine-ai-chip-in-iphone-x.html)
- [Deploying Transformers on the Apple Neural Engine — Apple Machine Learning Research, 2022](https://machinelearning.apple.com/research/neural-engine-transformers)
- [apple/ml-ane-transformers — reference implementation on GitHub](https://github.com/apple/ml-ane-transformers)
- [apple/ane-distilbert-base-uncased-finetuned-sst-2-english — optimized model on Hugging Face](https://huggingface.co/apple/ane-distilbert-base-uncased-finetuned-sst-2-english)
- [Core ML — Apple Developer Documentation](https://developer.apple.com/documentation/coreml)
- [Introducing Natural Language Framework — WWDC 2018, Session 713](https://nonstrict.eu/wwdcindex/wwdc2018/713/)
- [Advances in Natural Language Framework — WWDC 2019, Session 232](https://developer.apple.com/videos/play/wwdc2019/232/)
- [Advances in Natural Language Framework — ASCIIwwdc transcript](https://asciiwwdc.com/2019/sessions/232)
- [ml-explore/mlx — MLX framework on GitHub](https://github.com/ml-explore/mlx)
- [Foundation Models — Apple Developer Documentation](https://developer.apple.com/documentation/FoundationModels)
- [Meet the Foundation Models framework — WWDC 2025, Session 286](https://developer.apple.com/videos/play/wwdc2025/286/)
- [Deep dive into the Foundation Models framework — WWDC 2025, Session 301](https://developer.apple.com/videos/play/wwdc2025/301/)
- [MLComputeUnits — Apple Developer Documentation](https://developer.apple.com/documentation/coreml/mlcomputeunits)
- [MLComputeUnits.cpuAndNeuralEngine — Apple Developer Documentation](https://developer.apple.com/documentation/coreml/mlcomputeunits/cpuandneuralengine)
- [Optimize your Core ML usage — WWDC 2022, Session 10027](https://developer.apple.com/videos/play/wwdc2022/10027/)
- [Inside the M4 Apple Neural Engine, Part 1: Reverse Engineering — maderix](https://maderix.substack.com/p/inside-the-m4-apple-neural-engine)
- [maderix/ANE — training on the ANE via private APIs, on GitHub](https://github.com/maderix/ANE)
- [Anemll/Anemll — ANEMLL, open-source LLM inference on the Apple Neural Engine](https://github.com/Anemll/Anemll)
- [Anemll/anemll-bench — companion benchmarking suite for the ANE](https://github.com/Anemll/anemll-bench)
- [hollance/neural-engine — Matthijs Hollemans, "Everything we actually know about the Apple Neural Engine"](https://github.com/hollance/neural-engine)
- [mdaiter/ane — early ANE reverse engineering with Python and Objective-C samples](https://github.com/mdaiter/ane)
- [eiln/ane — reverse-engineered Linux driver for the ANE, Asahi Linux project](https://github.com/eiln/ane)
- [In-Datacenter Performance Analysis of a Tensor Processing Unit — Jouppi et al., ISCA 2017, arXiv:1704.04760](https://arxiv.org/abs/1704.04760)
- [In-Datacenter Performance Analysis of a Tensor Processing Unit — ACM Digital Library](https://dl.acm.org/doi/10.1145/3079856.3080246)
- [Coral NPU datasheet — Google for Developers](https://developers.google.com/coral/guides/hardware/datasheet)
- [Introducing Coral NPU: A full-stack platform for Edge AI — Google Developers Blog](https://developers.googleblog.com/en/introducing-coral-npu-a-full-stack-platform-for-edge-ai/)
- [google-coral/coralnpu — Coral NPU source on GitHub](https://github.com/google-coral/coralnpu)
- [Hexagon NPU — Qualcomm](https://www.qualcomm.com/products/technology/processors/hexagon)
- [Snapdragon for on-device AI — Qualcomm](https://www.qualcomm.com/products/technology/artificial-intelligence)
- [Gemini Nano via Android AICore — Google for Developers](https://developer.android.com/ai/aicore)
- [ML Kit GenAI APIs — Google for Developers](https://developers.google.com/ml-kit/genai)
- [Google AI Edge SDK overview — Google AI for Developers](https://ai.google.dev/edge)
- [ONNX Runtime documentation](https://onnxruntime.ai/docs/)
- [Hugging Face transformers.js — in-browser inference documentation](https://huggingface.co/docs/transformers.js)
- [RISC-V Vector Extension (RVV) specification — riscv/riscv-v-spec on GitHub](https://github.com/riscv/riscv-v-spec)

Videos and supplementary materials:

- [Optimize your Core ML usage — WWDC 2022, Session 10027, on Apple Developer](https://developer.apple.com/videos/play/wwdc2022/10027/) · [YouTube mirror](https://www.youtube.com/watch?v=THXq071qZ6E)
- [Meet the Foundation Models framework — WWDC 2025, Session 286, on Apple Developer](https://developer.apple.com/videos/play/wwdc2025/286/) · [YouTube mirror](https://www.youtube.com/watch?v=mJMvFyBvZEk)
- [Explore machine learning on Apple platforms — WWDC 2024, on Apple Developer](https://developer.apple.com/videos/play/wwdc2024/10223/) · [YouTube mirror](https://www.youtube.com/watch?v=p_hyo2FRil4)
- [Get started with MLX for Apple silicon — WWDC 2025, Session 315](https://developer.apple.com/videos/play/wwdc2025/315/)
- [Explore large language models on Apple silicon with MLX — WWDC 2025, Session 298](https://developer.apple.com/videos/play/wwdc2025/298/)
- [A Decade of Machine Learning Accelerators — David Patterson, slide deck, 2022](https://www.cs.ucla.edu/wp-content/uploads/cs/PATTERSON-10-Lessons-4-TPU-gens-CO2e-45-minutes.pdf)
- [Ten Lessons from Three Generations Shaped Google's TPUv4i — Jouppi et al., ISCA 2021](https://dl.acm.org/doi/abs/10.1109/ISCA52012.2021.00010)
- [Retrospective on the original TPU paper — Jouppi, 2023 PDF](https://bpb-us-w2.wpmucdn.com/sites.coecis.cornell.edu/dist/7/587/files/2023/06/Jouppi_2017_In_Datacenter.pdf)
- [What the Hell is a Neural Engine? — Greg Gant, 2024](https://blog.greggant.com/posts/2024/06/24/what-the-hell-is-an-apple-neural-engine.html)
- [Inside the M4 Apple Neural Engine, Part 2: Benchmarks — maderix](https://maderix.substack.com/p/inside-the-m4-apple-neural-engine-615)

## Sitemap

See [sitemap.md](https://jesserobbins.com/sitemap.md) for the full list of pages on this site.
