Authors:
(1) Opher Lieber, with Equal contribution; (2) Barak Lenz, with Equal contribution; (3) Hofit Bata; (4) Gal Cohen; (5) Jhonathan Osin; (6) Itay Dalmedigos; (7) Erez Safahi; (8) Shaked Meirom; (9) Yonatan Belinkov; (10) Shai Shalev-Shwartz; (11) Omri Abend; (12) Raz Alon; (13) Tomer Asida; (14) Amir Bergman; (15) Roman Glozman; (16) Michael Gokhman; (17) Avashalom Manevich; (18) Nir Ratner; (19) Noam Rozen; (20) Erez Shwartz; (21) Mor Zusman; (22) Yoav Shoham.
Table of Links
6.2 Why does the Combination Work?
The pure Mamba model showed fairly good results in most tasks early on, including in general perplexity evaluations. However, it performed substantially worse than the pure Attention model in three common benchmark tasks: IMDB [28], QuAC [5], and NarrativeQA [25]. In contrast, the hybrid Attention-Mamba performed similarly to the Attention model on these datasets. Table 6 shows the results for 1.3B models after 250B tokens.
Looking into these results further, we found out that the pure Mamba model often does not follow the correct format. For instance, in the IMDB dataset, answer choices are “Positive” or “Negative”. While the Attention model adheres to this format, the pure Mamba model often produces other answers, such as “Very Good”, “Very Positive”, “Funny”, “Bad”, “Poor”, and “3/10”. While these may be considered correct answers, the difficulty of Mamba to adhere to the format suggests a potential problem. Indeed, to perform successful in-context learning, it is important for models to capture the input-output format [30]. The hybrid Attention-Mamba model follows the format successfully, just like the pure Attention model.
We hypothesize that this phenomenon points to a limitation of SSMs – a potential difficulty in in-context learning (ICL). Indeed, the ability to perform ICL has been linked to the emergence of socalled induction heads in Transformer language models during training, which perform approximate copying operations that are supportive of ICL [31]. We conjecture that the lack of an attention mechanism in the pure Mamba model makes it difficult for it to learn in-context. While Mamba may learn to copy and perform simple ICL when explicitly trained to do so ([16, 32], it is not clear if ICL is an emergent capability in SSM as is typical of Transformer models. In contrast, the hybrid Attention–Mamba model does perform successful ICL, even when only 1 out of 8 layers is an Attention one.
As anecdotal evidence of an emergent induction mechanism, we visualize in Figure 7 the attention of an example head from a 1.3B Attention-Mamba hybrid model (no MoE), on an IMDB example where the pure Mamba failed and the hybrid succeeded. Clearly, the attention from the last token (“:”) is focused on the labels from the few-shot examples. We have found 12 such heads in our hybrid model, in all three attention layers (which correspond to layers 4, 12, 20 in the model).
Future work can further investigate the emergence of ICL in hybrid models at large scale. Our released checkpoints would hopefully facilitate such investigations. Finally, very recent work has attempted to extract attention-like scores from state-space models like Mamba [1], which opens another direction to search for induction capabilities in state-space models.
This paper is available on arxiv under CC BY-SA 4.0 DEED license.