The second blog in this series is an introduction to the Nash Equilibrium and how Regret Matching optimises toward the Equilibrium. This is in preparation for the final post in the series where we’ll be providing an introduction to Counterfactual Regret Minimisation in Texas Hold ’Em, which optimises toward the Equilibrium and arguably achieves it.
Remi’s mission is the development of neuroscience inspired A.I. We design agents and test their ability in different environments. Taking inspiration from the way humans excel at a wide variety of tasks, from driving, to poker, to writing blog posts, we’re working on agents that can easily and seamlessly adapt from one task to another. One of the key tools in this research are established psychology tests to better understand how the Agents we’re developing act in comparison to humans. As our Agents become more complex and autonomous, psychological tests will become more important as a means to understand their behaviour and inner workings.
We wanted to illustrate this idea of psychological tests with a naive example. The Prisoner’s Dilemma is a interesting and simple test that has become synonymous with Game Theory and the Nash Equilibrium. We wanted to test how Regret Matching and another Reinforcement Learning Agent performed in the Prisoner’s Dilemma and illustrate the value of such tests in A.I Research.
For those who aren’t familiar with the Prisoner’s Dilemma, here is a narrative introduction.
One day the police catch two individuals selling illicit substances on either side of the city.
They arrest a gentleman named Dave, who is caught red-handed in the act of selling narcotics. In a completely seperate and unrelated event, the police catch another gentleman named Luke also selling narcotics. They are both brought to the police station and placed in private cells.
The Police Commissioner, while interviewing the two gentleman in their respective cells, is reinforcing to them that they’re both facing two years in prison — it is an open and shut case. They’ll both be charged for possession with intent to distribute. They’re each going to get two years in jail.
But as he’s finishing delivering the news to Luke, the Commissioner begins to realise that these two gentlemen look very much like suspects in a previous offence, some weeks ago, which was a major armed robbery. Despite his strong suspicions of their links to this earlier crime, the Commissioner doesn’t have any strong evidence to link them, only the fact that they roughly match the description.
So, without sufficient evidence to charge them for the robbery, he conceives of a plan to offer them both seperate deals that incentivise them to snitch on the other.
The deal he makes to both Dave and Luke is as follows; you’re going to get 2 years for drug dealing, you were caught red-handed and the 2 years is guaranteed.
But if they confess and the other doesn’t, they’ll only be imprisoned for 1 year and the other will be imprisoned for 5 years.
And if they both confess, they both get 3 years.
Now this above story, details the Prisoners Dilemma. The deal can be seen in the payoff table below.
The Prisoner’s Dilemma Payoff Table
The bottom right hand field, where both parties deny, is the Global Optimum. This choice is the rational option where both parties commit the smallest mutual prison sentence. Yet, it is an unstable position. One cannot be assured that the other will also choose rationally. This possible irrationality of the other prisoner causes most humans to act irrationally and choose to snitch.
We wanted to test how A.I and specifically Regret Matching would react in the same environment. Would an Agent hold to the same preference toward snitching? Such a test would be interesting from a Nash Equilibrium point of view but also serves to illustrate an interesting tool in A.I research.
As Neuroscience Inspired A.I and especially Reinforcement Learning continues to advance, it’s fundamental that we begin to incorporate tests from psychology to study whether the Agents we’re building behave as humans and other mammals do. This will help elucidate whether our reward structures and algorithm design are outputting behaviour comparable to humans.
So we adopted the classic Prisoners Dilemma experiment that we detailed above and ran both Regret Matching and our own QRL Network in the experiment and found that these Agents both have a strong bias toward snitching — like humans. Both Agents optimised toward the Nash Equilibrium.
Above is the QRL Network over 1000 iterations of the Prisoner’s Dilemma Experiment showing a strong bias toward snitching.
So what is this Nash Equilibrium? Put simply, a Nash Equilibrium is a situation where the game has reached an impasse where none of the players can reach a better outcome by switching strategies. The concept was formalised by mathematician John Nash.
Put simply, a Nash Equilibrium is a situation where the game has reached an impasse where none of the players can reach a better outcome by switching strategies.
For example in two-player game such as the Prisoner’s Dilemma, a Nash equilibrium is an outcome where player 2’s strategy is the best response to player 1’s strategy and player 1’s strategy is a best response to player 2’s strategy.
In the context of this A.I Prisoner’s Dilemma, If Luke snitches, Dave’s best response is to snitch, since 3 years is better than 5. If Luke doesn’t snitch, Dave’s best response is to snitch, since 1 year is better than 2.
In general, it is possible to have multiple Nash equilibria or none at all.
Assuming you’re not currently incarcerated and being offered a deal to snitch on one of your fellow inmates. It will be helpful to explore a real world application of Nash Equilibrium and Reinforcement learners. The prisoner’s Dilemma is a trivial, constructed example, but Nash Equilibrium (and therefore CFR and other A.I Reinforcement Learners) have many real world applications. Any game or business environment where there are numerous competing players and various strategies - such as Poker and Paid Search Auctions, are subject to the Nash Equilibrium. We’ve deployed our own adaptations of Counterfactual Regret Minimisation in our Google Adwords Artificial Intelligence Platform, Remi AO, with interesting successes.
For the sake of brevity, we will not detail the full Google Adwords Auction Mechanism. But explore a simpler First-Priced Auction which still holds to a lot of the functionality of Paid Search Auctions.
The Auction we discuss here involves hidden bids. More precisely, the bidders simultaneously submit their bids in private, without knowledge of competitor bids.
The object (or in Adwords case, the Ad Position) is allocated, in exchange for a payment, to the bidder who made the highest bid. To keep things simple in this example we assume that when more than one bidder submitted the highest bid the object/Ad Position is allocated to the highest bidder with the lowest index. To formulate an auction as a strategic game we consider each bidder as a player, just as each prisoner was a player. Then we view each bid of player i as his possible strategy.
In this example, we assume that the valuations vᵢ are constrained and known to the competitors. In reality, this is rarely the case. Different Companies may value Ad Positions differently in online Ad Auctions due to different Profit Margins on the product they’re advertising.
However, this assumption is necessary, since the valuations are used in the payoff tables and therefore the players have an understanding of each others’ payoff functions.
The standard rule in a hidden bid auction is that the winner i pays to the auctioneer the full value of his bid. (Though again this is not quite accurate in many Online Ad Auctions). The resulting mechanism is called the first-price auction. Assume the winner is bidder i, whose winning bid is bᵢ. Since the created value for the object/ad position is vᵢ , the payoff(in other words, the profit) is vᵢ −bᵢ . For the other bidders the payoff is 0.
It’s very important to note that in this context and in Online Ad Auctions, the winner’s profit can be negative. This occurs when the Bidder submits a bid that is higher than the valuation of the object/Ad Position being auctioned.
To summarize, the payoff function pᵢ of player i in a first-price auction is defined as follows, where b is the vector of the submitted bids:
Consider the game associated with the first-price auction with the players’ valuations v. Then b is a Nash equilibrium if for i = argsmax b
Put simply, the winner does not bid too high — the bid is less than the value of the object/Ad Position.
2. max 𝚓≠ᵢ v𝚓 ≤ bᵢ
Put simply, the winner submitted a high enough bid to win.
3. bᵢ = max 𝗃≠ᵢ b𝗃
Another player submitted the same bid as player i
In the above, 1 and 2 denote that vᵢ = max v, which simply means that in every Nash equilibrium the bidder with the highest valuation is the winner. This is a fairly obvious but underestimated assumption, the bidder who holds the highest valuation has the ability to bid the highest, therefore winning the auction.
Suppose that a vector of bids b in an auction satisfies 1 to 3. Bidder i is the winner and by satisfying 1 their profit is positive. Their profit can increase if they bid less, but then by statement 3 another player with the 2nd highest bid becomes the winner, while player’s i payoff becomes 0. Conversely the payoff of any other player j is 0 and can increase only if they bid higher and become the winner. But then by statement 2, their payoff becomes negative.
So b is a Nash equilibrium.
In the final part in this series we will be exploring CFR in Poker, in particular in Texas Hold em.
By Shahrad Jamshidi and Alasdair Hamilton
For Part 1 in this series: https://hackernoon.com/artificial-intelligence-poker-and-regret-part-1-36c78d955720
The code for the Prisoners Dilemma: https://gist.github.com/shahradj/ebe9160ba3a996142e3b225d0f1ed5f6
And here:
https://gist.github.com/shahradj/dbc08f7d6cbb5264244d9dade66cdefc
Want to waste more time in your life? Check out our website remi.ai