256K Tokens on One GPU? Jamba’s Engineering Magic Explained

Authors:

(1) Opher Lieber, with Equal contribution; (2) Barak Lenz, with Equal contribution; (3) Hofit Bata; (4) Gal Cohen; (5) Jhonathan Osin; (6) Itay Dalmedigos; (7) Erez Safahi; (8) Shaked Meirom; (9) Yonatan Belinkov; (10) Shai Shalev-Shwartz; (11) Omri Abend; (12) Raz Alon; (13) Tomer Asida; (14) Amir Bergman; (15) Roman Glozman; (16) Michael Gokhman; (17) Avashalom Manevich; (18) Nir Ratner; (19) Noam Rozen; (20) Erez Shwartz; (21) Mor Zusman; (22) Yoav Shoham.

Table of Links

5. Evaluation

In general we approach benchmarks cautiously, as they correlate only partially with what matters in real applications, and furthermore invite gaming the system in order to boast vanity numbers. Nevertheless, we present several indicative results.

5.1 Academic Benchmarks

We report results with a wide range of standard academic benchmarks:

Common sense reasoning: HellaSwag (10-shot) [47], WinoGrande (5-shot) [37], ARC-E (0-shot) and ARC-Challenge (25-shot) [9], and PIQA (zero-shot) [3].

Reading Comprehension: BoolQ (10-shots) [8] and QuAC (zero-shot) [5].

Others: GSM8K (3-shot CoT) [10], HumanEval (pass@1) [4], Natural Questions closed-book (NQ; 5-shot) [26], and TruthfulQA (zero-shot) [27].

Aggregate benchmarks: MMLU (5-shot) [20] and BBH (3-shot) [43].

Table 2 compares Jamba to several publicly available models on common academic benchmarks for evaluating language models. We compare with Llama-2 13B [45], which has about the same number of active paramters as our model, Llama-2 70B, which is larger than our model, Gemma [44], which has 7B parameters, and Mixtral [23], which has about the same number of active and total parameters as our model.

Noticebly, Jamba performs comparably to the leading publicly available models of similar or larger size, including Llama-2 70B and Mixtral. At the same time, our model has a smaller number of total available parameters than Llama-2 (52B compared to 70B). Moreover, as a sparse model, Jamba has only 12B active parameters, similar to Mixtral’s 12.9B active parameters. However, as a fullyattentional model, Mixtral has a large memory footprint with long sequences, requiring 32GB for the KV cache with 256K tokens. In contrast, thanks to its hybrid Attention-Mamba architecture, Jamba’s KV cache takes only 4GB even at such a long context (Section 2). Importantly, our Jamba achieves such a strong performance while having much better throughput than Llama-2 70B and Mixtral, up to 3x improvement (Section 3.2).

In summary, Jamba demostrates the ability of hybrid architectures to reach the performance of state-of-the-art Transformer based models of the same size class, while having the benefits of an SSM.

5.2 Long-Context Evaluations

We have successfully trained Jamba models with context lengths of up to 1M tokens. The released model handles context lengths of up to 256K tokens. In this section, we evaluate it on synthetic and naturalistic benchmarks that test is long-context capabilities.

5.2.1 Needle-in-a-haystack

As Figure 4 shows, Jamba has excellent performance in the needle-in-a-haystack evaluation, which requires retrieving a simple statement planted in a long context window [24]. This result is noteworthy especially given that our implementation of Jamba uses only 4 attention layers.

5.2.2 Naturalistic long-context evaluation

We evaluate Jamba’s ability to handle long contexts using question-answering benchmarks, consisting of long inputs. To this end, we repurpose five of the longest-context datasets from L-Eval [2], by structuring them in a few-shot format (we use 3-shots in all experiments here). Specifically, we evaluated the models on the following datasets: NarrativeQA (QA on narratives; [25]), LongFQA (finance; [2]), Natural Questions (NQ; Wikipedia; [26]), CUAD (law; [21]), and SFiction (science fiction). The average input length in these datasets ranges from 6K to 62K tokens. These context lengths are further highly expanded by the few-shot format.

Table 3 summarizes the evaluation results, in terms of F1. Jamba outperforms Mixtral on most of the datasets as well as on average. In addition, as these long-context tasks require substantial computation, here Jamba’s efficiency shines, with much better throughput with long contexts (Section 3.2).

This paper is available on arxiv under CC BY-SA 4.0 DEED license.

256K Tokens on One GPU? Jamba’s Engineering Magic Explained

Too Long; Didn't Read

Table of Links

5. Evaluation

5.1 Academic Benchmarks

5.2 Long-Context Evaluations

About Author

TOPICS

THIS ARTICLE WAS FEATURED IN...

Categories

Trending Topics

256K Tokens on One GPU? Jamba’s Engineering Magic Explained

Too Long; Didn't Read

Table of Links

5. Evaluation

5.1 Academic Benchmarks

5.2 Long-Context Evaluations

About Author

TOPICS

THIS ARTICLE WAS FEATURED IN...

RELATED STORIES

Categories

Trending Topics