Language Models and the Illusion of Understanding

Authors:

(1) Raphaël Millière, Department of Philosophy, Macquarie University ([email protected]);

(2) Cameron Buckner, Department of Philosophy, University of Houston ([email protected]).

Table of Links

Abstract and 1 Introduction

2. A primer on LLMs

2.1. Historical foundations

2.2. Transformer-based LLMs

3. Interface with classic philosophical issues

3.1. Compositionality

3.2. Nativism and language acquisition

3.3. Language understanding and grounding

3.4. World models

3.5. Transmission of cultural knowledge and linguistic scaffolding

4. Conclusion, Glossary, and References

4. Conclusion

We began this review article by considering the skeptical concern that LLMs are merely sophisticated mimics that memorize and regurgitate linguistic patterns from their training data–akin to the Blockhead thought experiment. Taking this position as a null hypothesis, we critically examined the evidence that could be adduced to reject it. Our analysis revealed that the advanced capabilities of state-of-the-art LLMs challenge many of the traditional critiques aimed at artificial neural networks as potential models of human language and cognition. In many cases, LLMs vastly exceeds predictions about the performance upper bounds of non-classical systems. At the same time, however, we found that moving beyond the Blockhead analogy continues to depend upon careful scrutiny of the learning process and internal mechanisms of LLMs, which we are only beginning to understand. In particular, we need to understand what LLMs represent about the sentences they produce–and the world those sentences are about. Such an understanding cannot be reached through armchair speculation alone; it calls for careful empirical investigation. We need a new generation of experimental methods to probe the behavior and internal organization of LLMs. We will explore these methods, their conceptual foundations, and new issues raised by the latest evolution of LLMs in Part II.

Glossary

Blockhead A philosophical thought experiment introduced by Block (1981), illustrating a hypothetical system that mimics human-like responses without genuine understanding or intelligence. Blockhead’s responses are preprogrammed, allowing it to answer any conceivable question based on retrieval from an extensive database, akin to a hash table lookup. This system challenges traditional notions of intelligence by demonstrating behaviorally indistinguishable from a human’s, yet lacking the internal cognitive processes typically associated with intelligence. Blockhead serves as a critical example in discussions about the nature of artificial intelligence, emphasizing the distinction between mere behavioral mimicry and the presence of complex, internal information processing mechanisms as a hallmark of true intelligence. 2, 3, 10, 18, 20

generalization The ability of a neural network model to perform accurately on new, unseen data that is similar but not identical to the data it was trained on. This concept is central to evaluating the effectiveness of a model, as it indicates the extent to which the learned patterns and knowledge can be applied beyond the specific examples in the training dataset. A model that generalizes well maintains high performance when faced with new and varied inputs, demonstrating its adaptability and robustness across a broad range of scenarios. 3, 11–14, 20, 22

logit In the context of Transformer-based LLMs, a logit is the raw output of the model’s final layer before it undergoes a softmax transformation to become a probability distribution. Each logit corresponds to a potential output token (e.g., a word or subword unit), and its value indicates the model’s preliminary assessment of how likely that token is to be the next element in the sequence, given the input. The softmax function then converts these logits into a probability distribution, from which the model selects the most likely next token during text generation. 7

out-of-distribution (OOD) data In machine learning, OOD data refers to input data that significantly differs from the data the model was trained on. This type of data falls outside the distribution of the training dataset, presenting patterns, features, or characteristics that the model has not encountered during its training phase. OOD data is a critical concept because it challenges the model’s ability to generalize and maintain accuracy. Handling OOD data effectively is important for robustness and reliability, especially in real-world applications where the model is likely to encounter a wide variety of inputs. 20

self-attention A mechanism within Transformer-based neural networks that enables them to weigh and integrate information from different positions within the input sequence. In the context of LLMs, self-attention allows each token in a sentence to be processed in relation to every other token, facilitating the understanding of context and relationships within the text. This process involves calculating attention scores that reflect the relevance of each part of the input to every other part, thereby enhancing the model’s ability to capture dependencies, regardless of their distance in the sequence. This feature is key to LLMs’ ability to handle long-range dependencies and complex linguistic structures effectively. 5–7, 22

tokenization The process of breaking down text into smaller units, called tokens. These tokens can be words, subwords, characters, or other meaningful elements, depending on the granularity of the tokenization algorithm. The purpose of tokenization is to transform the raw text into a format that can be easily processed and understood by a language model. This step is crucial for preparing input data, as it directly affects the model’s ability to analyze and generate language. Tokenization plays a fundamental role in determining the level of detail and complexity a model can capture from the text, but can also have a downstream impact on the model’s performance with certain tasks such as arithmetic. 6, 22

train-test split In machine learning, the train-test split is a method used to evaluate the performance of a model. It involves dividing the available data into two distinct sets: a training set and a test set. The training set is used to train the model, allowing it to learn and adapt to patterns within the data. The test set, which consists of data not seen by the model during its training, is used to assess the model’s performance and generalization capabilities. This split is crucial for providing an unbiased evaluation of the model, as it demonstrates how the model is likely to perform on new, unseen data. 11

Transformer A type of neural network architecture introduced by Vaswani et al. (2017), predominantly used for processing sequential data such as text. It is characterized by its reliance on self-attention mechanisms, which enable it to weigh the importance of different parts of the input data. Unlike earlier architectures, Transformers do not require sequential data to be processed in order, allowing for more parallel processing and efficiency in handling long-range dependencies in data. This architecture forms the basis of most LLMs, known for its effectiveness in capturing complex linguistic patterns and relationships. 1, 5–7, 10–12, 19, 21

vector Mathematically, a vector is an ordered array of numbers, which can represent points in a multidimensional space. In the context of LLMs, vectors are used to represent tokens, where each token can map onto a word or part of a word depending on the tokenization scheme. These vectors, known as embeddings, encode the linguistic features and relationships of the tokens in a high-dimensional space. By converting tokens into vectors, LLMs are able to process and generate language based on the semantic and syntactic properties encapsulated in these numerical representations. 3–5, 7, 14–16, 22

References

Aiyappa, R., An, J., Kwak, H. & Ahn, Y.-Y. (2023), ‘Can we trust the evaluation on ChatGPT?’.

Akyürek, E., Akyürek, A. F. & Andreas, J. (2020), Learning to Recombine and Resample Data For Compositional Generalization, in ‘International Conference on Learning Representations’.

Alayrac, J.-B., Donahue, J., Luc, P., Miech, A., Barr, I., Hasson, Y., Lenc, K., Mensch, A., Millican, K., Reynolds, M., Ring, R., Rutherford, E., Cabi, S., Han, T., Gong, Z., Samangooei, S., Monteiro, M.,Menick, J. L., Borgeaud, S., Brock, A., Nematzadeh, A., Sharifzadeh, S., Bińkowski, M., Barreira, R., Vinyals, O., Zisserman, A. & Simonyan, K. (2022), ‘Flamingo: A Visual Language Model for Few-Shot Learning’, Advances in Neural Information Processing Systems 35, 23716–23736.

Andreas, J. (2020), Good-Enough Compositional Data Augmentation, in ‘Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics’, Association for Computational Linguistics, Online, pp. 7556–7566.

Andreas, J. (2022), Language Models as Agent Models, in ‘Findings of the Association for Computational Linguistics: EMNLP 2022’, Association for Computational Linguistics, Abu Dhabi, United Arab Emirates, pp. 5769–5779.

Anil, R., Dai, A. M., Firat, O., Johnson, M., Lepikhin, D., Passos, A., Shakeri, S., Taropa, E., Bailey, P., Chen, Z., Chu, E., Clark, J. H., Shafey, L. E., Huang, Y., Meier-Hellstern, K., Mishra, G., Moreira, E., Omernick, M., Robinson, K., Ruder, S., Tay, Y., Xiao, K., Xu, Y., Zhang, Y., Abrego, G. H., Ahn, J., Austin, J., Barham, P., Botha, J., Bradbury, J., Brahma, S., Brooks, K., Catasta, M., Cheng, Y., Cherry, C., Choquette-Choo, C. A., Chowdhery, A., Crepy, C., Dave, S., Dehghani, M., Dev, S., Devlin, J., Díaz, M., Du, N., Dyer, E., Feinberg, V., Feng, F., Fienber, V., Freitag, M., Garcia, X., Gehrmann, S., Gonzalez, L., Gur-Ari, G., Hand, S., Hashemi, H., Hou, L., Howland, J., Hu, A., Hui, J., Hurwitz, J., Isard, M., Ittycheriah, A., Jagielski, M., Jia, W., Kenealy, K., Krikun, M., Kudugunta, S., Lan, C., Lee, K., Lee, B., Li, E., Li, M., Li, W., Li, Y., Li, J., Lim, H., Lin, H., Liu, Z., Liu, F., Maggioni, M., Mahendru, A., Maynez, J., Misra, V., Moussalem, M., Nado, Z., Nham, J., Ni, E., Nystrom, A., Parrish, A., Pellat, M., Polacek, M., Polozov, A., Pope, R., Qiao, S., Reif, E., Richter, B., Riley, P., Ros, A. C., Roy, A., Saeta, B., Samuel, R., Shelby, R., Slone, A., Smilkov, D., So, D. R., Sohn, D., Tokumine, S., Valter, D., Vasudevan, V., Vodrahalli, K., Wang, X., Wang, P., Wang, Z., Wang, T., Wieting, J., Wu, Y., Xu, K., Xu, Y., Xue, L., Yin, P., Yu, J., Zhang, Q., Zheng, S., Zheng, C., Zhou, W., Zhou, D., Petrov, S. & Wu, Y. (2023), ‘PaLM 2 Technical Report’.

Askell, A., Bai, Y., Chen, A., Drain, D., Ganguli, D., Henighan, T., Jones, A., Joseph, N., Mann, B., DasSarma, N., Elhage, N., Hatfield-Dodds, Z., Hernandez, D., Kernion, J., Ndousse, K., Olsson, C., Amodei, D., Brown, T., Clark, J., McCandlish, S., Olah, C. & Kaplan, J. (2021), ‘A General Language Assistant as a Laboratory for Alignment’.

Auersperg, A. M. I. & von Bayern, A. M. P. (2019), ‘Who’s a clever bird — now? A brief history of parrot cognition’, Behaviour 156(5-8), 391–407.

Baier, A. C. (2002), Hume: The Reflective Women’s Epistemologist?, in ‘A Mind Of One’s Own’, 2 edn, Routledge.

Bender, E. M., Gebru, T., McMillan-Major, A. & Shmitchell, S. (2021), On the Dangers of Stochastic Parrots: Can Language Models Be Too Big? 🦜, in ‘Proceedings of the 2021 ACM Conference on Fairness, Accountability, and Transparency’, FAccT ’21, Association for Computing Machinery, New York, NY, USA, pp. 610–623.

Bender, E. M. & Koller, A. (2020), Climbing towards NLU: On Meaning, Form, and Understanding in the Age of Data, in ‘Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics’, Association for Computational Linguistics, Online, pp. 5185–5198.

Bengio, Y., Ducharme, R. & Vincent, P. (2000), A Neural Probabilistic Language Model, in ‘Advances in Neural Information Processing Systems’, Vol. 13, MIT Press.

Betker, J., Goh, G., Jing, L., Brooks, T., Wang, J., Li, L., Ouyang, L., Zhuang, J., Lee, J., Guo, Y. et al. (2023), ‘Improving image generation with better captions’, Computer Science. https://cdn. openai. com/papers/dall-e-3. pdf .

Block, N. (1981), ‘Psychologism and Behaviorism’, The Philosophical Review 90(1), 5–43.

Block, N. (1986), ‘Advertisement for a Semantics for Psychology’, Midwest Studies in Philosophy 10, 615–678.

Boleda, G. (2020), ‘Distributional Semantics and Linguistic Theory’, Annual Review of Linguistics 6(1), 213–234.

Brown, T. B., Mann, B., Ryder, N., Subbiah, M., Kaplan, J., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., Agarwal, S., Herbert-Voss, A., Krueger, G., Henighan, T., Child, R., Ramesh, A., Ziegler, D. M., Wu, J., Winter, C., Hesse, C., Chen, M., Sigler, E., Litwin, M., Gray, S., Chess, B., Clark, J., Berner, C., McCandlish, S., Radford, A., Sutskever, I. & Amodei, D. (2020), ‘Language Models are Few-Shot Learners’, arXiv:2005.14165 [cs] .

Bubeck, S., Chandrasekaran, V., Eldan, R., Gehrke, J., Horvitz, E., Kamar, E., Lee, P., Lee, Y. T., Li, Y., Lundberg, S., Nori, H., Palangi, H., Ribeiro, M. T. & Zhang, Y. (2023), ‘Sparks of Artificial General Intelligence: Early experiments with GPT-4’.

Buckner, C. (2017), Understanding Associative and Cognitive Explanations in Comparative Psychology, in ‘The Routledge Handbook of Philosophy of Animal Minds’, Routledge.

Buckner, C. (2021), ‘Black Boxes or Unflattering Mirrors? Comparative Bias in the Science of Machine Behaviour’, The British Journal for the Philosophy of Science pp. 000–000.

Buckner, C. J. (2023), From Deep Learning to Rational Machines: What the History of Philosophy Can Teach Us about the Future of Artificial Intelligence, Oxford University Press, Oxford, New York.

Butlin, P. (2021), ‘Sharing Our Concepts with Machines’, Erkenntnis .

Carnie, A. (2021), Syntax: A Generative Introduction, John Wiley & Sons.

Cho, K., van Merrienboer, B., Gulcehre, C., Bahdanau, D., Bougares, F., Schwenk, H. & Bengio, Y. (2014), ‘Learning Phrase Representations using RNN Encoder-Decoder for Statistical Machine Translation’.

Chollet, F. (2019), ‘On the Measure of Intelligence’.

Chomsky, N. (1957), Syntactic Structures, Mouton.

Chomsky, N. (2000), Knowledqe of Lanquaqe: Its Nature, Oriqin and Use, in R. J. Stainton, ed., ‘Perspectives in the Philosophy of Language: A Concise Anthology’, Broadview Press, p. 3.

Christiano, P. F., Leike, J., Brown, T., Martic, M., Legg, S. & Amodei, D. (2017), Deep Reinforcement Learning from Human Preferences, in ‘Advances in Neural Information Processing Systems’, Vol. 30, Curran Associates, Inc.

Conklin, H., Wang, B., Smith, K. & Titov, I. (2021), Meta-Learning to Compositionally Generalize, in ‘Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers)’, Association for Computational Linguistics, Online, pp. 3322–3335.

Csordás, R., Irie, K. & Schmidhuber, J. (2022), CTL++: Evaluating Generalization on Never-Seen Compositional Patterns of Known Functions, and Compatibility of Neural Representations, in ‘Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing’, Association for Computational Linguistics, Abu Dhabi, United Arab Emirates, pp. 9758–9767.

Dąbrowska, E. (2015), ‘What exactly is Universal Grammar, and has anyone seen it?’, Frontiers in Psychology 6.

Firth, J. R. (1957), ‘A synopsis of linguistic theory, 1930-1955’, Studies in linguistic analysis.

Fodor, J. A. (1975), The Language of Thought, Harvard University Press.

Fodor, J. A. & Pylyshyn, Z. W. (1988), ‘Connectionism and cognitive architecture: A critical analysis’, Cognition 28(1), 3–71.

Grand, G., Blank, I. A., Pereira, F. & Fedorenko, E. (2022), ‘Semantic projection recovers rich human knowledge of multiple object features from word embeddings’, Nature Human Behaviour 6(7), 975– 987.

Grynbaum, M. M. & Mac, R. (2023), ‘The Times Sues OpenAI and Microsoft Over A.I. Use of Copyrighted Work’, The New York Times.

Ha, D. & Schmidhuber, J. (2018), ‘World Models’.

Harnad, S. (1990), ‘The symbol grounding problem’, Physica D: Nonlinear Phenomena 42(1), 335–346.

Harris, Z. S. (1954), ‘Distributional structure’, Word 10, 146–162.

He, Z., Xie, Z., Jha, R., Steck, H., Liang, D., Feng, Y., Majumder, B. P., Kallus, N. & Mcauley, J. (2023), Large Language Models as Zero-Shot Conversational Recommenders, in ‘Proceedings of the 32nd ACM International Conference on Information and Knowledge Management’, CIKM ’23, Association for Computing Machinery, New York, NY, USA, pp. 720–730.

Herbold, S., Hautli-Janisz, A., Heuer, U., Kikteva, Z. & Trautsch, A. (2023), ‘A large-scale comparison of human-written versus ChatGPT-generated essays’, Scientific Reports 13(1), 18617.

Hochreiter, S. & Schmidhuber, J. (1997), ‘Long Short-Term Memory’, Neural Computation 9(8), 1735– 1780.

Huebner, P. A., Sulem, E., Cynthia, F. & Roth, D. (2021), BabyBERTa: Learning More Grammar With Small-Scale Child-Directed Language, in A. Bisazza & O. Abend, eds, ‘Proceedings of the 25th Conference on Computational Natural Language Learning’, Association for Computational Linguistics, Online, pp. 624–646.

Hume, D. (1978), A Treatise of Human Nature, 2nd edition edn, Oxford University Press, Oxford.

Hupkes, D., Giulianelli, M., Dankers, V., Artetxe, M., Elazar, Y., Pimentel, T., Christodoulopoulos, C., Lasri, K., Saphra, N., Sinclair, A., Ulmer, D., Schottmann, F., Batsuren, K., Sun, K., Sinha, K., Khalatbari, L., Ryskina, M., Frieske, R., Cotterell, R. & Jin, Z. (2023), ‘A taxonomy and review of generalization research in NLP’, Nature Machine Intelligence 5(10), 1161–1174.

Jelinek, F. (1998), Statistical Methods for Speech Recognition, MIT Press, Cambridge, MA, USA.

Jones, C. & Bergen, B. (2023), ‘Does GPT-4 Pass the Turing Test?’.

Karhade, M. (2023), ‘GPT-4: 8 Models in One ; The Secret is Out’.

Kasneci, E., Sessler, K., Küchemann, S., Bannert, M., Dementieva, D., Fischer, F., Gasser, U., Groh, G., Günnemann, S., Hüllermeier, E., Krusche, S., Kutyniok, G., Michaeli, T., Nerdel, C., Pfeffer, J., Poquet, O., Sailer, M., Schmidt, A., Seidel, T., Stadler, M., Weller, J., Kuhn, J. & Kasneci, G. (2023), ‘ChatGPT for good? On opportunities and challenges of large language models for education’, Learning and Individual Differences 103, 102274.

Keysers, D., Schärli, N., Scales, N., Buisman, H., Furrer, D., Kashubin, S., Momchev, N., Sinopalnikov, D., Stafiniak, L., Tihon, T., Tsarkov, D., Wang, X., van Zee, M. & Bousquet, O. (2019), Measuring Compositional Generalization: A Comprehensive Method on Realistic Data, in ‘International Conference on Learning Representations’

Kheiri, K. & Karimi, H. (2023), ‘SentimentGPT: Exploiting GPT for Advanced Sentiment Analysis and its Departure from Current Machine Learning’.

Kim, N. & Linzen, T. (2020), COGS: A Compositional Generalization Challenge Based on Semantic Interpretation, in ‘Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP)’, Association for Computational Linguistics, Online, pp. 9087–9105.

Kripke, S. (1980), Naming and Necessity, Harvard University Press, Cambridge, MA.

Lake, B. & Baroni, M. (2018), Generalization without Systematicity: On the Compositional Skills of Sequence-to-Sequence Recurrent Networks, in ‘Proceedings of the 35th International Conference on Machine Learning’, PMLR, pp. 2873–2882.

Lake, B. M. & Baroni, M. (2023), ‘Human-like systematic generalization through a meta-learning neural network’, Nature pp. 1–7.

Lake, B. M., Ullman, T. D., Tenenbaum, J. B. & Gershman, S. J. (2017), ‘Building machines that learn and think like people’, Behavioral and Brain Sciences 40.

Lasnik, H. & Lohndal, T. (2010), ‘Government–binding/principles and parameters theory’, WIREs Cognitive Science 1(1), 40–50.

Lavechin, M., Sy, Y., Titeux, H., Blandón, M. A. C., Räsänen, O., Bredin, H., Dupoux, E. & Cristia, A. (2023), ‘BabySLM: Language-acquisition-friendly benchmark of self-supervised spoken language models’.

LeCun, Y. (n.d.), ‘A Path Towards Autonomous Machine Intelligence’.

Lee, N., Sreenivasan, K., Lee, J., Lee, K. & Papailiopoulos, D. (2023), Teaching Arithmetic to Small Transformers, in ‘The 3rd Workshop on Mathematical Reasoning and AI at NeurIPS’23’.

Lewkowycz, A., Andreassen, A., Dohan, D., Dyer, E., Michalewski, H., Ramasesh, V., Slone, A., Anil, C., Schlag, I., Gutman-Solo, T., Wu, Y., Neyshabur, B., Gur-Ari, G. & Misra, V. (2022), ‘Solving Quantitative Reasoning Problems with Language Models’.

Liang, W., Zhang, Y., Cao, H., Wang, B., Ding, D., Yang, X., Vodrahalli, K., He, S., Smith, D., Yin, Y., McFarland, D. & Zou, J. (2023), ‘Can large language models provide useful feedback on research papers? A large-scale empirical analysis’.

Long, B., Goodin, S., Kachergis, G., Marchman, V. A., Radwan, S. F., Sparks, R. Z., Xiang, V., Zhuang, C., Hsu, O., Newman, B., Yamins, D. L. K. & Frank, M. C. (2023), ‘The BabyView camera: Designing a new head-mounted camera to capture children’s early social and visual environments’, Behavior Research Methods.

donald, C. (1995), Classicism Vs. Connectionism, in C. Macdonald & G. F. Macdonald, eds, ‘Connectionism: Debates on Psychological Explanation’, Blackwell.

Mandelkern, M. & Linzen, T. (2023), ‘Do Language Models Refer?’.

Marconi, D. (1997), Lexical Competence, MIT Press.

McCoy, R. T., Yao, S., Friedman, D., Hardy, M. & Griffiths, T. L. (2023), ‘Embers of Autoregression: Understanding Large Language Models Through the Problem They are Trained to Solve’.

McGrath, S., Russin, J., Pavlick, E. & Feiman, R. (2023), ‘Properties of LoTs: The footprints or the bear itself?’.

Mikolov, T., Chen, K., Corrado, G. & Dean, J. (2013), ‘Efficient Estimation of Word Representations in Vector Space’, arXiv:1301.3781 [cs].

Millière, R. (forthcoming), Language Models as Models of Language, in R. Nefdt, G. Dupre & K. H. Jain, eds, ‘The Oxford Handbook of the Philosophy of Linguistics’, Oxford University Press, Oxford.

Mirchandani, S., Xia, F., Florence, P., Ichter, B., Driess, D., Arenas, M. G., Rao, K., Sadigh, D. & Zeng, A. (2023), ‘Large Language Models as General Pattern Machines’.

Mirowski, P., Mathewson, K. W., Pittman, J. & Evans, R. (2023), Co-Writing Screenplays and Theatre Scripts with Language Models: Evaluation by Industry Professionals, in ‘Proceedings of the 2023 CHI Conference on Human Factors in Computing Systems’, CHI ’23, Association for Computing Machinery, New York, NY, USA, pp. 1–34.

Mollo, D. C. & Millière, R. (2023), ‘The Vector Grounding Problem’.

Murty, S., Sharma, P., Andreas, J. & Manning, C. D. (2023), ‘Grokking of Hierarchical Structure in Vanilla Transformers’.

Ontanon, S., Ainslie, J., Fisher, Z. & Cvicek, V. (2022), Making Transformers Solve Compositional Tasks, in ‘Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)’, Association for Computational Linguistics, Dublin, Ireland, pp. 3591–3607.

OpenAI (2022), ‘Introducing ChatGPT’.

OpenAI (2023a), ‘GPT-4 Technical Report’.

OpenAI (2023b), ‘GPT-4V(ision) System Card’.

Osgood, C. E. (1952), ‘The nature and measurement of meaning’, Psychological bulletin 49(3), 197– 237.

Pavlick, E. (2023), ‘Symbols and grounding in large language models’, Philosophical Transactions of the Royal Society A: Mathematical, Physical and Engineering Sciences 381(2251), 20220041.

Pearl, L. (2022), ‘Poverty of the Stimulus Without Tears’, Language Learning and Development 18(4), 415–454.

Piantadosi, S. (2023), ‘Modern language models refute Chomsky’s approach to language’.

Piantadosi, S. & Hill, F. (2022), ‘Meaning without reference in large language models’.

Pinker, S. & Prince, A. (1988), ‘On language and connectionism: Analysis of a parallel distributed processing model of language acquisition’, Cognition 28(1), 73–193.

Portelance, E. & Jasbi, M. (2023), ‘The roles of neural networks in language acquisitio’.

Putnam, H. (1975), ‘The Meaning of ’Meaning”, Minnesota Studies in the Philosophy of Science 7, 131–193.

Qiu, L., Shaw, P., Pasupat, P., Nowak, P., Linzen, T., Sha, F. & Toutanova, K. (2022), Improving Compositional Generalization with Latent Structure and Data Augmentation, in ‘Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies’, Association for Computational Linguistics, Seattle, United States, pp. 4341–4362.

Quilty-Dunn, J., Porot, N. & Mandelbaum, E. (2022), ‘The Best Game in Town: The Re-Emergence of the Language of Thought Hypothesis Across the Cognitive Sciences’, Behavioral and Brain Sciences pp. 1–55.

Raffel, C., Shazeer, N., Roberts, A., Lee, K., Narang, S., Matena, M., Zhou, Y., Li, W. & Liu, P. J. (2020), ‘Exploring the limits of transfer learning with a unified text-to-text transformer’, The Journal of Machine Learning Research 21(1), 140:5485–140:5551.

Ramesh, A., Dhariwal, P., Nichol, A., Chu, C. & Chen, M. (2022), ‘Hierarchical Text-Conditional Image Generation with CLIP Latents’.

Salton, G., Wong, A. & Yang, C. S. (1975), ‘A vector space model for automatic indexing’, Communications of the ACM 18(11), 613–620.

Savelka, J., Agarwal, A., An, M., Bogart, C. & Sakr, M. (2023), Thrilled by Your Progress! Large Language Models (GPT-4) No Longer Struggle to Pass Assessments in Higher Education Programming Courses, in ‘Proceedings of the 2023 ACM Conference on International Computing Education Research V.1’, pp. 78–92.

Savelka, J., Ashley, K. D., Gray, M. A., Westermann, H. & Xu, H. (2023), Can GPT-4 Support Analysis of Textual Data in Tasks Requiring Highly Specialized Domain Expertise?, in ‘Proceedings of the 2023 Conference on Innovation and Technology in Computer Science Education V. 1’, pp. 117–123.

Schmidhuber, J. (1990), Towards Compositional Learning with Dynamic Neural Networks, Inst. für Informatik.

Schut, L., Tomasev, N., McGrath, T., Hassabis, D., Paquet, U. & Kim, B. (2023), ‘Bridging the Human-AI Knowledge Gap: Concept Discovery and Transfer in AlphaZero’.

Searle, J. R. (1980), ‘Minds, Brains, and Programs’, Behavioral and Brain Sciences 3(3), 417–57.

Shinn, N., Cassano, F., Berman, E., Gopinath, A., Narasimhan, K. & Yao, S. (2023), ‘Reflexion: Language Agents with Verbal Reinforcement Learning’.

Smolensky, P. (1988), ‘On the proper treatment of connectionism’, Behavioral and Brain Sciences 11(1), 1–23.

Smolensky, P. (1989), Connectionism and Constituent Structure, in R. Pfeifer, Z. Schreter, F. FogelmanSoulié & L. Steels, eds, ‘Connectionism in Perspective’, Elsevier.

Smolensky, P., McCoy, R., Fernandez, R., Goldrick, M. & Gao, J. (2022a), ‘Neurocompositional Computing: From the Central Paradox of Cognition to a New Generation of AI Systems’, AI Magazine 43(3), 308–322.

Smolensky, P., McCoy, R. T., Fernandez, R., Goldrick, M. & Gao, J. (2022b), ‘Neurocompositional computing in human and machine intelligence: A tutorial’.

Sober, E. (1998), Morgan’s canon, in ‘The Evolution of Mind’, Oxford University Press, New York, NY, US, pp. 224–242.

Sullivan, J., Mei, M., Perfors, A., Wojcik, E. & Frank, M. C. (2021), ‘SAYCam: A Large, Longitudinal Audiovisual Dataset Recorded From the Infant’s Perspective’, Open Mind 5, 20–29.

Tomasello, M. (2009), Constructing a Language, Harvard University Press.

Touvron, H., Martin, L., Stone, K., Albert, P., Almahairi, A., Babaei, Y., Bashlykov, N., Batra, S., Bhargava, P., Bhosale, S., Bikel, D., Blecher, L., Ferrer, C. C., Chen, M., Cucurull, G., Esiobu, D., Fernandes, J., Fu, J., Fu, W., Fuller, B., Gao, C., Goswami, V., Goyal, N., Hartshorn, A., Hosseini, S., Hou, R., Inan, H., Kardas, M., Kerkez, V., Khabsa, M., Kloumann, I., Korenev, A., Koura, P. S., Lachaux, M.-A., Lavril, T., Lee, J., Liskovich, D., Lu, Y., Mao, Y., Martinet, X., Mihaylov, T., Mishra, P., Molybog, I., Nie, Y., Poulton, A., Reizenstein, J., Rungta, R., Saladi, K., Schelten, A., Silva, R., Smith, E. M., Subramanian, R., Tan, X. E., Tang, B., Taylor, R., Williams, A., Kuan, J. X., Xu, P., Yan, Z., Zarov, I., Zhang, Y., Fan, A., Kambadur, M., Narang, S., Rodriguez, A., Stojnic, R., Edunov, S. & Scialom, T. (2023), ‘Llama 2: Open Foundation and Fine-Tuned Chat Models’.

Tshitoyan, V., Dagdelen, J., Weston, L., Dunn, A., Rong, Z., Kononova, O., Persson, K. A., Ceder, G. & Jain, A. (2019), ‘Unsupervised word embeddings capture latent knowledge from materials science literature’, Nature 571(7763), 95–98.

Turing, A. M. (1950), ‘Computing Machinery and Intelligence’, Mind 59(236), 433–460.

Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., Kaiser, Ł. & Polosukhin, I. (2017), Attention is All you Need, in I. Guyon, U. V. Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan & R. Garnett, eds, ‘Advances in Neural Information Processing Systems 30’, Curran Associates, Inc., pp. 5998–6008.

Wallace, E., Wang, Y., Li, S., Singh, S. & Gardner, M. (2019), Do NLP Models Know Numbers? Probing Numeracy in Embeddings, in K. Inui, J. Jiang, V. Ng & X. Wan, eds, ‘Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP)’, Association for Computational Linguistics, Hong Kong, China, pp. 5307–5315.

Wang, L., Lyu, C., Ji, T., Zhang, Z., Yu, D., Shi, S. & Tu, Z. (2023), ‘Document-Level Machine Translation with Large Language Models’.

Wang, R., Todd, G., Yuan, E., Xiao, Z., Côté, M.-A. & Jansen, P. (2023), ‘ByteSized32: A Corpus and Challenge Task for Generating Task-Specific World Models Expressed as Text Games’.

Warstadt, A. & Bowman, S. R. (2022), What Artificial Neural Networks Can Tell Us about Human Language Acquisition, in ‘Algebraic Structures in Natural Language’, CRC Press.

Warstadt, A., Mueller, A., Choshen, L., Wilcox, E., Zhuang, C., Ciro, J., Mosquera, R., Paranjabe, B., Williams, A., Linzen, T. & Cotterell, R. (2023), Findings of the BabyLM Challenge: SampleEfficient Pretraining on Developmentally Plausible Corpora, in A. Warstadt, A. Mueller, L. Choshen, E. Wilcox, C. Zhuang, J. Ciro, R. Mosquera, B. Paranjabe, A. Williams, T. Linzen & R. Cotterell, eds, ‘Proceedings of the BabyLM Challenge at the 27th Conference on Computational Natural Language Learning’, Association for Computational Linguistics, Singapore, pp. 1–6.

Weaver, W. (1955), Translation, in W. N. Locke & D. A. Booth, eds, ‘Machine Translation of Languages’, MIT Press, Boston, MA.

Wei, J., Wang, X., Schuurmans, D., Bosma, M., Ichter, B., Xia, F., Chi, E., Le, Q. V. & Zhou, D. (2022), ‘Chain-of-Thought Prompting Elicits Reasoning in Large Language Models’, Advances in Neural Information Processing Systems 35, 24824–24837.

Winograd, T. (1971), ‘Procedures as a Representation for Data in a Computer Program for Understanding Natural Language’.

Wittgenstein, L. (1953), Philosophical Investigations, Wiley-Blackwell, New York, NY, USA.

Zeng, A., Attarian, M., Ichter, B., Choromanski, K., Wong, A., Welker, S., Tombari, F., Purohit, A., Ryoo, M., Sindhwani, V., Lee, J., Vanhoucke, V. & Florence, P. (2022), ‘Socratic Models: Composing Zero-Shot Multimodal Reasoning with Language’.

Zhang, C., Bengio, S., Hardt, M., Recht, B. & Vinyals, O. (2021), ‘Understanding deep learning (still) requires rethinking generalization’, Communications of the ACM 64(3), 107–115.

Zhang, T., Ladhak, F., Durmus, E., Liang, P., McKeown, K. & Hashimoto, T. B. (2023), ‘Benchmarking Large Language Models for News Summarization’.

Zhou, A., Wang, K., Lu, Z., Shi, W., Luo, S., Qin, Z., Lu, S., Jia, A., Song, L., Zhan, M. & Li, H. (2023), ‘Solving Challenging Math Word Problems Using GPT-4 Code Interpreter with Code-based Self-Verification’.

This paper is available on arxiv under CC BY 4.0 DEED license.