Methods for Decoding Opponent Actions and Optimizing Responses

Authors:

(1) Avrim Blum, Toyota Technological Institute at Chicago, IL, USA;

(2) Melissa Dutz, Toyota Technological Institute at Chicago, IL, USA.

Table of Links

Abstract and 1 Introduction

2 Setting and 2.1 Models of behaviorally-biased opponents

3 Preliminaries and Intuition

4.1 Myopic Best Responder and 4.2 Gambler’s Fallacy Opponent

4.3 Win-Stay, Lose-Shift Opponent

4.4 Follow-the-Leader Opponent and 4.5 Highest Average Payoff Opponent

5 Generalizing

5.1 Other Behaviorally-Biased Strategies

5.2 Exploiting an Unknown Strategy from a Known Set of Strategies

6 Future Work and References

A Appendix

A.1 Win-Stay Lose-Shift Variant: Tie-Stay

A.2 Follow-the-Leader Variant: Limited History

A.3 Ellipsoid Mistake Bounds

A.4 Highest Average Payoff Opponent

3. Preliminaries and Intuition

While the strategy of playing action i against action j roughly as many times as we play action j against action i (for all i, j) employed by the Copycat algorithm in Feldman et al. [8] is sufficient to roughly tie in a symmetric zero-sum game, we’d like to go further and win nearly every round. To do this, we will learn to predict the opponent’s actions and learn best responses to those actions the opponent plays. It turns out that predicting the opponent’s actions is the “easy” part, and the main challenge will be learning best responses. We discuss each of these at a high level below.

Predicting the Opponent’s Actions

It is possible to learn to predict any of our opponents’ actions using a simple halving algorithm (see [11]). Recall that in our setting, an opponent’s choice of action is deterministic and induced by their known biased strategy, their tie-breaking mechanism, the unknown game matrix, and the history of play. Consider all possible combinations of the opponent’s known biased strategy with a potential tie-breaking mechanism to be a family of deterministic strategies. Then, consider all possible pairs including a deterministic strategy from that family and a game matrix of the appropriate size. Predict the opponent’s next action by taking the majority vote over the predictions corresponding to these pairs and the play history. Then, eliminate all pairs that did not correspond to a correct prediction. Each time a mistake is made, at least half of the remaining pairs must have predicted incorrectly, so at least half of the remaining pairs are eliminated. Given a finite number of possible tiebreaking mechanisms, this algorithm clearly results in a finite number of prediction errors. However, it is inefficient: even with our assumption of only 3 possible payoff values, there are roughly 3 n 2 possible game matrices, and for each game matrix, there may be many tiebreaking mechanisms to consider: throughout this paper we generally assume the opponent breaks ties according to some fixed ordering over actions, in which case there are n!. We will give more efficient prediction algorithms for each of the opponents we introduced above by taking further advantage of their specific strategies.

Learning Best Responses

It may seem like being able to predict generally is sufficient to learn best responses, but this is not the case. For example, a general (exponential) algorithm which does not work is to find the lexicographically first game matrix and action ordering pair consistent with all the opponent’s actions so far and their behavioral strategy, and then play a best response to the action predicted by that pair according to the corresponding game matrix. However, even for the simplest opponent we consider, the Myopic Best Responder, and identical orderings over actions, there could be two game matrices M and M∗ which are consistent with the opponent’s play, but playing best responses with respect to M could result in losing most rounds (and tying in the remaining rounds) if the true game matrix is M∗ (see the example below). Instead, we will more actively exploit the opponent’s particular strategy to elicit best responses.

Example: Playing Optimally According to a Consistent Game Matrix and Action Ordering is Not Sufficient to Win

Action Ordering Ω: R, P, S, R’, P’, S’

Suppose we are playing a game M∗ against the Myopic Best Responder, whose ordering over actions is Ω. Suppose that we predict the Myopic Best Responder’s actions according to the correct action ordering Ω but a slightly different game matrix M, and play best responses according to M. Both games are like an expanded version of rock-paper-scissors, in which the interactions among R, P, S and among R’,P’, S’ respectively are the same as in standard rock-paper-scissors, but the interactions between those two sets of actions differs between M and M∗ ; the differing entries are bolded for convenience. Consider the following sequence of play, which is consistent with M∗ , M, Ω, and the strategy of the Myopic Best Responder, and optimal according to M:

The sequence shown in the first three rounds will repeat indefinitely as long as we continue to play the same best responses with respect to M, leading us to lose or tie in every round.

Note that it is possible to construct very similar, larger games in which we lose to the Myopic Best Responder an arbitrarily large fraction of the time (depending on the number of actions in the game), since we only need to tie against the first action in the opponent’s action ordering to avoid discovering any inconsistency.

This paper is available on arxiv under CC BY 4.0 DEED license.