Paper TL;DR

The paper introduced Markov games, a model from game theory, to the AI/ML community as a way of thinking about multi-agent RL. It also provided a concrete algorithm for solving such problems—minimax Q-learning. It provided empirical results showing advantages of this approach over standard Q-learning.

Overall Outlook

I’m glad I wrote the paper. At the time, I was enthralled by the idea of reinforcement learning and wanted to apply the ideas to a kind of video game called “sumo” that my mentor David Ackley had designed. Thinking about the game, it became clear to me that Q-learning, which was still young, wouldn’t be able to generate a good strategy for the game—some random behavior is needed to fake out the opponent. That insight led me to learning about game theory and ultimately I saw that a natural extension of the Markov decision process (MDP) model—so popular for understanding RL environments—could capture this kind of problem quite well. The basic idea of the model is simple to conceptualize if you are already familiar with MDPs. An MDP consists of a set of states, a set of actions, a transition function describing the impact of the selection of an action on the current state, and a reward function associated with states and actions that should be maximized. A Markov game adds a second set of actions under the control of a second agent. One agent is selecting actions to maximize reward and the other is selecting actions to minimize it. Any MDP is just a Markov game where the minimizer has only one choice in every state. Classic zero-sum matrix games are Markov games where there is only one state. Any standard board-game where the players take turns can be viewed as a Markov game, but the model can also express games where the players must make their choices simultaneously. The paper described these connections and also provided a Q-learning-like algorithm for learning optimal decisions in these games. I thought it was useful to broaden the standard MDP model to allow for more decision makers.

Looking back, I’m heartened that other people found this idea useful as well.

I learned a few things about the problem in the years that followed publication:

  • I couldn’t prove that minimax-Q converged, although it sure seemed like it should. Later, I connected with Csaba Szespvari, who helped me articulate specific conditions for convergence, which minimax-Q satisfied (whew!). Those conditions were helpful for proving convergence of a bunch of other learning rules, so I’m glad we did that.
  • I later learned that the Markov games model was better known by a different name—“stochastic games”. I wish I had used that name. I also found out that some convergence results were known about the model (1953) from BEFORE MDPs were even invented (1957). Specifically, the value-iteration algorithm for computing optimal policies for MDPs was already proven to converge for the more general stochastic games model. I was thinking that Markov games were a generalization of MDPs, but, really, MDPs are a restriction of Markov/stochastic games!
  • I was really interested in the general-sum version of the problem, but couldn’t figure out anything concrete in time for the paper. I kind of brushed it off, saying “Only the specific case of two-player zero-sum games is addressed…”, thinking I’d have something more to say by the next deadline. Turns out, the general sum case is really really really subtle. I got to write several more papers about it and there’s still plenty more to say because it’s not really clear that the general sum setting can be meaningfully solved. So, that’s been good and bad.

Opportunities for Improvement

I wish I had gotten more into general sum games in the paper. There was a followup paper later that addressed general sum games in a way that I think is unsatisfactory. That paper led to two other papers pointing out its shortcomings, so I won’t belabor the point here. But, I might have been able to save a lot of people a lot of trouble if I had clarified what was known about general sum games in this paper.

You know, or not. Maybe this explicit back and forth was needed to explore the idea.

But, apart from the general sum games issue, I think one major improvement over this paper was the work on “R-Max”, which showed how a polynomial bound on the number of suboptimal actions could be made when learning in Markov games. (Link: http://www.jmlr.org/papers/v3/brafman02a.html .) It’s still useful to have a model-free method like minimax-Q, but the guarantees for R-Max are so much nicer. In particular, the R-Max algorithm is model based—it uses its experience to build an approximate version of the Markov game it is in. For states that have not been visited sufficiently, it substitutes in an artificial large reward that is the maximum possible reward. That’s where the name “R-Max” comes from. This choice means that learners are drawn to states they haven’t visited many times, leading to improvements to their model. These days, people use R-Max as an approach for learning in MDPs. But, the paper presents it in the context of zero-sum games and shows that it results in near optimal behavior in all but a polynomial number of steps.

The paper did get some criticism. In the experiments presented in the paper, there was a separation between training and testing. That is, after having a chance to learn, the agent’s behavior was frozen, then it had to face new learners. So, the goal was to learn a policy that was robust—hard to exploit. Some later work asked, why not just continue learning? In particular, a feature of minimax-Q is that it learns a stochastic policy, but it’s not clear a stochastic policy is needed if the agent is allowed to continue to adapt. If it starts choosing one action too often and that makes it exploitable, a continually-adapting agent could then make the necessary adjustments.

I suppose that’s ok and I agree that the artificial separation between training and testing is awkward. But, I still think it’s very reasonable to want the learning algorithm to produce a high-quality non-learning artifact. That’s standard in supervised learning and single agent RL. It’s not obvious why multiagent RL should be considered fundamentally different. One could argue that both continual learning and train/test have their place.

It was also pointed out that the train/test setup I used has an important flaw in that the behavior of the opponent during testing has a big influence on what the agent is able to learn. If the opponent is uncooperative (and why would an opponent in a zero-sum game be cooperative?), minimax-Q might only experience a limited fraction of the state space and therefore play extremely poorly in the testing phase. The R-Max work provides an account for this issue by NOT separating training and testing and instead measuring the number of mistakes made lifelong. However, R-Max at least learns the target stochastic policy, so it is arguably a better way of looking at things than either train/test OR continual re-learning.

I haven’t found it in print, but the dynamics of the soccer game also got some criticism. Uther proposed playing on a hexagonal grid, and I think he argued that this topology uses the field more effectively—for the rectangular grid I proposed, play ends up being restricted to just the middle of the field.

New Perspectives

Looking back, the paper was very much a product of its time. The reinforcement-learning community was still pretty young and Q-learning was almost the only game in town. People were interested in using RL to learn about games. In fact, Tesauro’s TD-gammon work, which helped launch the first big wave of interest in reinforcement learning and temporal difference learning, was contemporary with this paper and also focused on learning in a zero-sum game. Providing information about how these two items—games and Q-learning—combine seemed like a clarifying step.

But, it could also be seen as somewhat naive. The experiments were small scale and there was no theoretical guarantees provided. Those steps had to wait until later.