Inside Jamba’s Architecture: Mamba Layers, MoE, and the Future of AI Models

by Language Models (dot tech)April 10th, 2025
Read on Terminal Reader
tldt arrow

Too Long; Didn't Read

Jamba is a hybrid large language model architecture that combines Transformer, Mamba (state-space), and Mixture-of-Experts (MoE) layers. Designed for high efficiency and long-context processing (up to 256K tokens), it delivers strong benchmark performance with only 12B active parameters and runs on a single 80GB GPU—offering 3x the throughput of similar-sized models.

Company Mentioned

Mention Thumbnail
featured image - Inside Jamba’s Architecture: Mamba Layers, MoE, and the Future of AI Models
Language Models (dot tech) HackerNoon profile picture
0-item

Authors:

(1) Opher Lieber, with Equal contribution; (2) Barak Lenz, with Equal contribution; (3) Hofit Bata; (4) Gal Cohen; (5) Jhonathan Osin; (6) Itay Dalmedigos; (7) Erez Safahi; (8) Shaked Meirom; (9) Yonatan Belinkov; (10) Shai Shalev-Shwartz; (11) Omri Abend; (12) Raz Alon; (13) Tomer Asida; (14) Amir Bergman; (15) Roman Glozman; (16) Michael Gokhman; (17) Avashalom Manevich; (18) Nir Ratner; (19) Noam Rozen; (20) Erez Shwartz; (21) Mor Zusman; (22) Yoav Shoham.

Part 1

Part 2

Part 3

Part 4

Part 5

Part 6

6. Ablations and Insights

This section discusses ablation experiments we ran for different design choices in our implementation of the Jamba architecture. First we show the benefit of combining attention and Mamba layers, at which ratio they should be combined, and how to interleave them. We investigate cases where pure Mamba fails, suggesting that it struggles to develop in-context learning capabilities, while the Attention–Mamba hybrid exhibits in-context learning similar to vanilla Transformers. Then we show the benefit of adding MoE on top of a hybrid Attention–Mamba model. Finally, we share two additional learnings that we found useful: explicit positional information is not needed in Jamba, and Mamba layers necessitate special normalization to stabilize training at large scale.[4]


For these ablations, we report the following measures, which exhibit informative performance even at small data or model scale.


• Academic benchmarks: HellaSwag (10-shot) [47], WinoGrande (5-shot) [37], Natural Questions closed-book (NQ; 5-shot) [26].


• HuggingFace OpenLLM leaderboard (OLLM) [11]: a summary statistic of several datasets. We report results with our reproduction.


• Perplexity evaluations: we report log-prob (per byte) on texts from three domains: C4, Books, and code.

6.1 Benefits of combining Attention and Mamba

We first investigate the ratio of Attention to Mamba layers (a : m), with 1.3B parameters models trained for 250B tokens. As Table 4 shows, the hybrid Jamba model outperforms the pure attention or Mamba models. The ratio of attention-to-Mamba layers may be 1:3 or 1:7 with virtually no performance difference. Figure 5 shows the training loss of these models, where Jamba exhibits improved loss during training. Given that a 1:7 ratio is more compute-efficient and shows similar performance, we opt for it in our larger-scale experiments.


Table 4: Results on academic benchmarks and log-probability evaluations showing an improved performance of Attention-Mamba (no MoE) compared to vanilla Attention and Mamba models. There is no substantial difference between 1:3 and 1:7 ratios of Attention-to-Mamba layers. Models are 1.3B parameters, trained for 250B tokens.


Figure 5: Training loss curves for pure Attention, pure Mamba, and Attention-Mamba hybrids (no MoE), with ratios a : m of 1:3 and 1:4. All models are 1.3B parameters. The two hybrids achieve better loss throughout this training run, without any noticeable difference between the different Attention/Mamba ratios.


Next, we compare performance of vanilla Transformer, vanilla Mamba, and Attention-Mamba hybrid models, at 7B model size, after training on 50B tokens. As Table 5 shows, the pure Mamba layer is quite competitive, but lags slightly behind pure Attention. The hybrid Attention-Mamba (without MoE) outperforms the pure models while obtaining better throughput than the vanilla Transformer (Section 3.2).


Table 5: Results on academic benchmarks and log-prob evaluations, comparing pure Attention, pure Mamba, and Attention-Mamba hybrid (no MoE). Models are 7B parameters, trained for 50B tokens.


Figure 6 shows the training loss of the three architectures. While the pure Transformer and Mamba models have a similar convergence, the hybrid Jamba (no MoE) has a lower loss throughout this run.


Figure 6: Training loss curves for pure Attention, pure Mamba, and an Attention-Mamba hybrid (no MoE). All models are 7B parameters. the hybrid achives better loss throughout this training run.


This paper is available on arxiv under CC BY-SA 4.0 DEED license.


[4] In all the ablation experiments, “pure Mamba” refers to models with Mamba layers interleaved with MLP layers.

Trending Topics

blockchaincryptocurrencyhackernoon-top-storyprogrammingsoftware-developmenttechnologystartuphackernoon-booksBitcoinbooks