256K Tokens on One GPU? Jamba’s Engineering Magic Explained

by Language Models (dot tech)April 10th, 2025
Read on Terminal Reader
tldt arrow

Too Long; Didn't Read

Jamba is a hybrid large language model architecture that combines Transformer, Mamba (state-space), and Mixture-of-Experts (MoE) layers. Designed for high efficiency and long-context processing (up to 256K tokens), it delivers strong benchmark performance with only 12B active parameters and runs on a single 80GB GPU—offering 3x the throughput of similar-sized models.
featured image - 256K Tokens on One GPU? Jamba’s Engineering Magic Explained
Language Models (dot tech) HackerNoon profile picture
0-item

Authors:

(1) Opher Lieber, with Equal contribution; (2) Barak Lenz, with Equal contribution; (3) Hofit Bata; (4) Gal Cohen; (5) Jhonathan Osin; (6) Itay Dalmedigos; (7) Erez Safahi; (8) Shaked Meirom; (9) Yonatan Belinkov; (10) Shai Shalev-Shwartz; (11) Omri Abend; (12) Raz Alon; (13) Tomer Asida; (14) Amir Bergman; (15) Roman Glozman; (16) Michael Gokhman; (17) Avashalom Manevich; (18) Nir Ratner; (19) Noam Rozen; (20) Erez Shwartz; (21) Mor Zusman; (22) Yoav Shoham.

Part 1

Part 2

Part 3

Part 4

Part 5

Part 6

5. Evaluation

In general we approach benchmarks cautiously, as they correlate only partially with what matters in real applications, and furthermore invite gaming the system in order to boast vanity numbers. Nevertheless, we present several indicative results.

5.1 Academic Benchmarks

We report results with a wide range of standard academic benchmarks:


Common sense reasoning: HellaSwag (10-shot) [47], WinoGrande (5-shot) [37], ARC-E (0-shot) and ARC-Challenge (25-shot) [9], and PIQA (zero-shot) [3].


Reading Comprehension: BoolQ (10-shots) [8] and QuAC (zero-shot) [5].


Others: GSM8K (3-shot CoT) [10], HumanEval (pass@1) [4], Natural Questions closed-book (NQ; 5-shot) [26], and TruthfulQA (zero-shot) [27].


Aggregate benchmarks: MMLU (5-shot) [20] and BBH (3-shot) [43].


Table 2: Comparison of Jamba with other publicly available models. Jamba obtains similar or better performance with much better throughput.


Table 2 compares Jamba to several publicly available models on common academic benchmarks for evaluating language models. We compare with Llama-2 13B [45], which has about the same number of active paramters as our model, Llama-2 70B, which is larger than our model, Gemma [44], which has 7B parameters, and Mixtral [23], which has about the same number of active and total parameters as our model.


Noticebly, Jamba performs comparably to the leading publicly available models of similar or larger size, including Llama-2 70B and Mixtral. At the same time, our model has a smaller number of total available parameters than Llama-2 (52B compared to 70B). Moreover, as a sparse model, Jamba has only 12B active parameters, similar to Mixtral’s 12.9B active parameters. However, as a fullyattentional model, Mixtral has a large memory footprint with long sequences, requiring 32GB for the KV cache with 256K tokens. In contrast, thanks to its hybrid Attention-Mamba architecture, Jamba’s KV cache takes only 4GB even at such a long context (Section 2). Importantly, our Jamba achieves such a strong performance while having much better throughput than Llama-2 70B and Mixtral, up to 3x improvement (Section 3.2).


In summary, Jamba demostrates the ability of hybrid architectures to reach the performance of state-of-the-art Transformer based models of the same size class, while having the benefits of an SSM.

5.2 Long-Context Evaluations

We have successfully trained Jamba models with context lengths of up to 1M tokens. The released model handles context lengths of up to 256K tokens. In this section, we evaluate it on synthetic and naturalistic benchmarks that test is long-context capabilities.


5.2.1 Needle-in-a-haystack


As Figure 4 shows, Jamba has excellent performance in the needle-in-a-haystack evaluation, which requires retrieving a simple statement planted in a long context window [24]. This result is noteworthy especially given that our implementation of Jamba uses only 4 attention layers.


5.2.2 Naturalistic long-context evaluation


We evaluate Jamba’s ability to handle long contexts using question-answering benchmarks, consisting of long inputs. To this end, we repurpose five of the longest-context datasets from L-Eval [2], by structuring them in a few-shot format (we use 3-shots in all experiments here). Specifically, we evaluated the models on the following datasets: NarrativeQA (QA on narratives; [25]), LongFQA (finance; [2]), Natural Questions (NQ; Wikipedia; [26]), CUAD (law; [21]), and SFiction (science fiction). The average input length in these datasets ranges from 6K to 62K tokens. These context lengths are further highly expanded by the few-shot format.


Figure 4: A needle-in-a-haystack evaluation showing Jamba’s ability to recall statements placed inthe middle of contexts of up to 256K tokens length.


Table 3 summarizes the evaluation results, in terms of F1. Jamba outperforms Mixtral on most of the datasets as well as on average. In addition, as these long-context tasks require substantial computation, here Jamba’s efficiency shines, with much better throughput with long contexts (Section 3.2).


Table 3: Results (F1) on long-context QA benchmarks, with a 3-shot format.


This paper is available on arxiv under CC BY-SA 4.0 DEED license.


Trending Topics

blockchaincryptocurrencyhackernoon-top-storyprogrammingsoftware-developmenttechnologystartuphackernoon-booksBitcoinbooks