Authors:
(1) M. Carrasco, Facultad de Ingenierıa, Universidad ORT Uruguay, Montevideo, Uruguay ([email protected]);
(2) F. Mayr, Facultad de Ingenierıa, Universidad ORT Uruguay, Montevideo, Uruguay ([email protected]);
(3) S. Yovine, Facultad de Ingenierıa, Universidad ORT Uruguay, Montevideo, Uruguay ([email protected]);
(4) J. Kidd, Facultad de Ingenierıa, Universidad ORT Uruguay, Montevideo, Uruguay;
(5) M. Iturbide, Facultad de Ingenierıa, Universidad ORT Uruguay, Montevideo, Uruguay;
(6) J. da Silva, Facultad de Ingenierıa, Universidad ORT Uruguay, Montevideo, Uruguay;
(7) A. Garat, Facultad de Ingenierıa, Universidad ORT Uruguay, Montevideo, Uruguay.
Table of Links
4 Analyzing large language models
5 Conclusions. Acknowledgements, and References
3 Learning algorithm
Performance experiments We compare Omit-Zero against two instances of QNT, varying the behavior of the teacher: Standard uses Hopcroft-Karp algorithm [3] as EQ and MQ as in [6], while Teacher-Filter checks if the string being queried by MQ traverses a 0-probability transition, in which case it identifies it as undefined. Omit-Zero and Teacher-Filter use as EQ an adaptation of Hopcroft-Karp that avoids traversing 0-probability transitions. The comparison is done by randomly generating PDFA. First, we construct DFA using the algorithm in [9], which for a given nominal size of n it generates DFA of actual reachable size normally distributed around n. Then, DFA are transformed into PDFA by assigning a random probability distribution over Σ$ to every state. A parameter θ is used to control the probability of a symbol to be 0.
Running times as function of θ. 10 random PDFA with n = 500 and |Σ| = m = 20 were generated for each θ ∈ [0.9, 1), with step 0.02. Each one was run 10 times for every PDFA using quantization equivalence ([6]), adapted to satisfy (3), with parameter κ = 100. Fig. 3(a) shows Omit-Zero has the best performance, with an almost constant but important improvement with respect to Teacher-Filter.
Running times as function of n. We compared the performance on 10 random PDFA with n = 250, 500, 750, 1000, and m = 10, using κ = 10 and θ = 0.9. Each algorithm was run 10 times for each PDFA. Fig. 3(b) shows the median of the execution time curves for n. Omit-Zero is always faster than the other two, achieving a speedup of approximately 24x and 3x with respect to Standard and Teacher-Filter, respectively, for n = 1000.
4 Analyzing large language models
Guiding generation Guiding an LLM to generate strings of interest consists in synchronizing it with a automaton that defines the set of symbols that can be drawn at each step of the generation process, which could be constrained further by a sampling strategy. To illustrate how the synchronization works, consider the language model given by the PDFA L in Fig. 4 (0-probabilities are omitted). The guide G is a weighted automaton that defines a mask at each state: a weight of 1 for a symbol means it is allowed, otherwise it is not. L × G is a weighted automaton whose underlying structure is the product automaton, and weights are obtained by taking the product of the distribution of the state of L with the weights of the state of G. To obtain PDFA B, we apply the sampling strategy samptop2.
5 Conclusions
This work was motivated by the need of understanding LLM when their operation is controlled by external artifacts, such as grammars, to generate text following a specific format. An important question that arise in this context is how to deal with 0-probabilities that appear when restricting their output. To start up with, we revised the congruence (1) in order to make constructing the quotient less dependent on P by expressing it in terms of the output of the language model. The first consequence of this operational view is to allow a generalization of the congruence capable of dealing with equivalences on distributions. Besides, it led to developing a variant of the QNT active-learning algorithm to efficiently learn PDFA by avoiding to check for 0-probability transitions as much as possible. This is
essential to make it computationally feasible by reducing the number of queries to the LLM. The experimental results[3] support the viability of our approach for analyzing and validating statistical properties of LLM, such as bias in text generation. Besides, they provided evidence that distributions resulting from generation of a guided LLM could be well approximated by a learnt PDFA. This opens the door to make these analyses less dependent on sampling by studying properties of the PDFA.
Acknowledgements Research reported in this article has been partially funded by ANII-Agencia Nacional de Investigacion e Innovaci ´ on under grants IA ´ 1 2022 1 173516, FMV 1 2023 1 175864, POS NAC 2023 1 178663, and POS FMV 2023 1 1012218.
References
[1] R. C. Carrasco and J. Oncina. Learning deterministic regular grammars from stochastic samples in polynomial time. RAIRO - Theoretical Informatics and Applications, 33(1):1–19, 1999.
[2] L. Du, L. Torroba Hennigen, T. Pimentel, C. Meister, J. Eisner, and R. Cotterell. A measure-theoretic characterization of tight language models. In 61st ACL, pages 9744–9770, July 2023.
[3] J. E. Hopcroft and R. M. Karp. A linear algorithm for testing equivalence of finite automata. 1971.
[4] I. Khmelnitsky, D. Neider, R. Roy, X. Xie, B. Barbot, B. Bollig, A. Finkel, S. Haddad, M. Leucker, and L. Ye. Property-directed verification and robustness certification of recurrent neural networks. In ATVA, pages 364–380, Cham, 2021. Springer International Publishing.
[5] M. Kuchnik, V. Smith, and G. Amvrosiadis. Validating large language models with ReLM. In MLSys, 2023.
[6] F. Mayr, S. Yovine, M. Carrasco, F. Pan, and F. Vilensky. A congruence-based approach to active automata learning from neural language models. In PMLR, volume 217, pages 250–264, 2023.
[7] F. Mayr, S. Yovine, and R. Visca. Property checking with interpretable error characterization for recurrent neural networks. MAKE, 3(1):205–227, 2021.
[8] E. Muskardin, M. Tappler, and B. K. Aichernig. Testing-based black-box extraction of simple models from rnns ˇ and transformers. In PMLR, volume 217, pages 291–294, 10–13 Jul 2023.
[9] C. Nicaud. Random deterministic automata. In MFCS’14, pages 5–23. LNCS 8634, 2014.
[10] P. C. Shields. The ergodic theory of discrete sample paths. AMS, 1996.
[11] E. Vidal, F. Thollard, C. de la Higuera, F. Casacuberta, and R.C. Carrasco. Probabilistic finite-state machines - part i. IEEE PAMI, 27(7):1013–1025, 2005.
[12] Q. Wang, K. Zhang, A. G. Ororbia, II, X. Xing, X. Liu, and C. L. Giles. An empirical evaluation of rule extraction from recurrent neural networks. Neural Comput., 30(9):2568–2591, 2018.
[13] G. Weiss, Y. Goldberg, and E. Yahav. Extracting automata from recurrent neural networks using queries and counterexamples. In PMLR, volume 80, 10–15 Jul 2018.
[14] B. T. Willard and R. Louf. Efficient guided generation for LLMs. Preprint arXiv:2307.09702, 2023.
A Proof of Proposition 2.1
Proof. Let u and v in Σ ∗ be arbitrary.
B Proof of Proposition 2.2
C Proof of Proposition 2.3
D Proof of Proposition 2.4
This paper is
[2] https://huggingface.co/docs/transformers/main_classes/tokenizer
[3] https://github.com/neuralchecker/analyzing_constrained_LLM_through_PDFA_learning