AI Science
Edition 3, March 22, 2026, 12:24 PM
In This Edition
The Flash-MoE story continues to evolve as the community grapples with practical implications: new commentary highlights the tension between impressive demos on high-end Macs and actual accessibility for most users, plus an intriguing debate about whether a "batch rendering" paradigm could make slow local inference viable for complex queries.
A new section covers a quietly significant arXiv preprint reporting what may be the first non-trivial error in a published physics paper caught through formal verification — researchers using the Lean theorem prover invalidated a 20-year-old theorem on Higgs doublet model stability, raising pointed questions about how much of the physics literature would survive rigorous machine-checked proof.
A Visual Atlas of Attention Variants in Modern LLMs
Sebastian Raschka has published a comprehensive visual guide to attention variants used in current open-weight LLMs, alongside a new LLM architecture gallery with 45+ entries (HN discussion). The guide traces the evolution from standard multi-head attention (MHA) through the efficiency frontier, and serves as a useful snapshot of where the field stands architecturally in early 2026.
The taxonomy is instructive. Grouped-query attention (GQA) remains the workhorse — used in Llama 3, Qwen3, Gemma 3, and many MoE models — reducing KV-cache cost by sharing key-value projections across multiple query heads. It's a spectrum: fewer groups means cheaper inference but can degrade modeling quality. Multi-head latent attention (MLA), introduced in DeepSeek-V2, takes a different approach: instead of reducing the number of KV heads, it compresses what gets cached via a learned latent representation. DeepSeek's ablations showed MLA preserving or even exceeding MHA quality at the same memory budget — a stronger claim than "it's just cheaper." MLA now appears in DeepSeek V3, Kimi K2, GLM-5, and Mistral Large 3, though Raschka notes it reportedly works best above ~100B parameters.
Sliding window attention (SWA) limits each token to a fixed local context window, with periodic global attention layers for full-sequence information flow. Gemma 3 pushed from a 1:1 to a 5:1 local-to-global ratio with a 1024-token window, with ablations showing minimal perplexity degradation. DeepSeek Sparse Attention, from V3.2, goes further by learning which past tokens to attend to via a lightning indexer and token selector, rather than hard-coding locality.
The most striking trend is the rise of hybrid architectures that replace most attention layers with cheaper linear or state-space modules. Qwen3-Next pioneered a 3:1 mix of Gated DeltaNet (a linear-attention variant related to Mamba-2) and gated full-attention blocks. Qwen3.5 promoted this from experimental side-branch to main flagship — a strong signal that hybrid attention is production-ready. Kimi Linear swaps in channel-wise gating and gated MLA. Ling 2.5 uses Lightning Attention with MLA. NVIDIA's Nemotron goes further with Mamba-2 as the primary sequence module. Raschka observes that while hybrids offer superior long-context efficiency, their inference stacks aren't yet as optimized as classic GQA setups for local deployment, and the field is still waiting on DeepSeek V4 to set the next trend.
2025 Turing Award Recognizes Quantum Information Pioneers
The ACM has named Charles H. Bennett and Gilles Brassard as co-recipients of the 2025 A.M. Turing Award for their foundational contributions to quantum information science — the first time the prize has recognized quantum computing research (HN discussion). Their most celebrated work is the BB84 quantum key distribution protocol (1984), which guarantees security through the laws of physics rather than mathematical complexity: any eavesdropper measuring quantum-encoded photons necessarily disturbs them, creating a detectable trace.
Bennett's earlier contributions are equally foundational. His 1973 proof that computation can be carried out reversibly — run forward and backward with no net energy cost — established a deep connection between physics and information theory, building on Rolf Landauer's 1961 argument that information is fundamentally physical. In 1993, Bennett co-authored the quantum teleportation protocol, demonstrating that quantum states can be transferred between locations using entanglement. The practical urgency of their cryptographic work has only grown since Peter Shor's 1994 proof that quantum computers could break most classical encryption — making BB84-style quantum key distribution look less like a theoretical curiosity and more like critical infrastructure for the post-quantum era.
"System 3": How AI May Be Reshaping Human Reasoning
A widely-discussed SSRN preprint proposes that AI tools are creating a "System 3" in human cognition — extending Kahneman's dual-process framework of fast intuitive thinking (System 1) and slow deliberate reasoning (System 2) with a third mode: offloading cognitive work to AI (HN discussion, 103 comments). The paper finds that participants with higher trust in AI and lower need for cognition showed greater "surrender" to this System 3, and that time pressure and incentives shifted baseline performance but did not eliminate the pattern.
The HN discussion (181 points) was substantive and divided. Several commenters reported personal experience with cognitive offloading — one noted they used to manually sanity-check financial data from SEC filings but stopped after relying on AI, catching fewer errors. Others argued AI had improved their reasoning by helping them discover solutions to longstanding problems. Skeptics raised concerns about the underlying framework itself, noting that Kahneman's System 1/System 2 model has faced replication challenges. Multiple readers also flagged that parts of the paper appeared to be AI-generated, adding an ironic layer to research about AI's effects on human thinking. The paper remains a preprint and does not appear to have undergone peer review.
ArXiv Declares Independence from Cornell
In a significant governance shift for scientific publishing infrastructure, arXiv has declared independence from Cornell University, which has hosted the pioneering preprint server since 1991 (HN discussion, 272 comments). The story gathered 799 points on HN, reflecting the research community's deep investment in the platform that underpins open-access dissemination across physics, mathematics, computer science, and — critically for AI research — machine learning.
While the full Science article is paywalled, the move represents a maturation of arXiv's institutional status. The preprint server has become the de facto venue for AI and ML research publication, with most significant papers appearing there weeks or months before formal peer review. Any governance change to arXiv has downstream implications for how AI research is disseminated, discovered, and cited.
Flash-MoE: SSD-Streaming Inference Sparks a Community Ecosystem
Flash-MoE, the pure C/Metal inference engine that streams the 397-billion parameter Qwen3.5-397B-A17B Mixture-of-Experts model from SSD on a MacBook Pro with 48GB RAM, continues to hold the front page — now at 234 points and 86 comments (discussion). The project achieves 4.4+ tokens/second at 4-bit quantization by streaming only the active experts (~6.75MB per layer) via parallel pread() calls, relying on the OS page cache (~35GB, ~71% hit rate) rather than custom caching.
The key engineering insights remain the project's most compelling contribution: 58+ experiments documented with unusual candor. Speculative early routing hurt by 38% due to cache pollution. Prefetching via F_RDADVISE was net zero because SSD DMA competes with GPU bandwidth on Apple Silicon's unified memory controller. An MLP-based expert routing predictor achieved only 31% accuracy. The winning optimization was an FMA-rearranged dequantization kernel — fma(nibble, scale*x, bias*x) instead of (nibble * scale + bias) * x — yielding 12% speed from letting the GPU's fused multiply-add unit handle dequant and multiply in one instruction.
The discussion has now become a substantive technical forum. mkw created mlx-flash, a fork that extends Flash-MoE's streaming approach with 4-bit quantization, hybrid disk+RAM streaming with a tunable "control knob," and support for Mamba2 architectures — targeting intelligence-dense models like Nemotron 3 Nano 30B on 16GB machines. Meanwhile, tarruda — better known as the creator of Neovim — shared detailed llama-bench numbers for Qwen 3.5 397B at 2.46 BPW on an M1 Ultra with 128GB: 20 tok/s generation, 190 tok/s prompt processing at empty context, degrading gracefully to 8 tok/s and 41 tok/s at 250K tokens. The benchmark scores are respectable: 87.86% MMLU, 82.32% GPQA Diamond, versus the original BF16 model's 88% GPQA. The GPU power draw? A remarkably modest ~54 watts.
A lively quantization quality debate emerged. Aurornis noted that 2-bit quants "look promising in short sessions but then you try to do real work and realize they're a waste of time," citing Flash-MoE's own finding that 2-bit produced \name\ instead of "name" in JSON output. justacatbot argued that "a well-tuned 30B at 4-bit usually outperforms a 70B+ at 2-bit" for actual work. But tarruda countered that not all quants at a given bits-per-weight are equal — the smol-IQ2_XS quant uses a dynamic mix of q8_0, q6_k, q4_k, and iq2_xs across different tensors, and maintained coherence through 70K context.
As the story matures, community pushback on framing has sharpened. jllyhill voiced a sentiment likely shared by many: "I'm getting tired of 'laptop' in every one of these clickbait titles turning out to be $3000 Macbook… I really don't like that the title implies local LLM becomes viable for an average person with the actual hardware being out of reach for 99%." It's a fair point — the hardware floor for meaningfully running 397B parameters remains a $3,000+ Mac with 48GB+ unified memory. An interesting counterpoint came from qiine, who drew an analogy to offline rendering: "To render movies we happily wait for the computer to calculate how lights bounce around, for hours even days. So why not do the same with AIs?" — suggesting a batch-oriented paradigm where you submit complex queries to large models and retrieve answers later. Aurornis pushed back, noting that interactive LLM workflows involve hundreds or thousands of turns per day, making slow inference impractical. The broader signal: SSD-streaming MoE inference has gone from proof-of-concept to an active ecosystem in under 24 hours, but the gap between demo and daily-driver remains real.
Lean Theorem Prover Catches First Non-Trivial Error in Published Physics Paper
A quietly significant preprint on arXiv reports what may be the first non-trivial error in a published physics paper discovered through formal verification (discussion). Researchers used the Lean interactive theorem prover, together with the Mathlib and PhysLib (formerly PhysLean) libraries, to formalize a widely cited 2006 paper by Maniatis, von Manteuffel, Nachtmann, and Nagel on the stability of the two Higgs doublet model (2HDM) potential — a cornerstone of beyond-Standard-Model particle physics. The formalization revealed an error that invalidates the paper's main theorem.
The finding is noteworthy on several levels. The 2HDM potential stability theorem has been cited for twenty years and underpins subsequent theoretical work. That it survived peer review and two decades of use before being caught by a machine proof underscores a systemic issue: the mathematical arguments in physics papers are often trusted on the basis of reputation and intuitive plausibility rather than rigorous formal verification. The authors pose an "uncomfortable question" — how many other physics papers would fail under this higher level of scrutiny?
This sits at the intersection of two active research threads: the growing maturity of interactive theorem provers as practical tools (Lean's ecosystem, especially Mathlib, has expanded dramatically in recent years), and the push for greater reproducibility in scientific research. While formal verification of physics has been largely an academic curiosity, this result provides a concrete demonstration of its value — not as a rubber stamp on correct work, but as a tool for catching subtle errors that human reviewers miss. The story has modest traction on HN so far (8 points), but its implications for the relationship between AI-adjacent formal methods and scientific rigor could prove more lasting than the headline suggests.