TwoSided Bandit Problems and Matching MarketsSanmay DasThe ProblemWe are studying the learning and decision problems of agents in matching models. Das and Kamenica have defined a class of problems in multiagent learning and decisionmaking called twosided bandit problems[1]. This class of problems is intended to capture the essence of twosided matching scenarios in which agents must learn their preferences through experience, rather than knowing them a priori. Twosided bandit models can be applied to a wide range of markets in which two different types of agents must match with each other. Consider a repeated game in which agents gain an uncertain payoff from being matched with a particular agent on the other side of the market in each time period. A natural example of such a situation is the dating market, in which men and women repeatedly go out on dates and learn about each other. Another example is a spot labor market, in which employers and employees are matched for particular job contracts. A matching mechanism is used to pair the agents. For example, we can consider a mechanism in which all the women decide which man to "ask out," and then each man selects a woman from his set of offers, with the rejected women left unmatched for that period. We are studying the properties of learning and decisionmaking algorithms in such settings. ContextStandard models of matching in economics[2] almost always assume that each agent knows his or her preferences over the individuals on the other side of the market. This assumption is too restrictive for many markets and the model introduced here is driven by the need to relax this assumption. The existing literature on twosided search with nontransferable utility (for example[3]) assumes matching is exogenous and random. The problem we are studying is more deeply related to bandit problems[4], in which an agent must choose which arm of an narmed bandit to pull in order to maximize longterm expected reward, taking into account the tradeoff between exploring, that is learning more about the reward distribution for each arm, and exploiting, pulling the arm with the maximum expected reward. Twosided bandit problems extend the standard multiarmed bandit model by giving the arms of the bandit agency  they can decide whether to be pulled or not, or whom to be pulled by, and they themselves receive rewards based on who the puller is. This adds an enormous degree of complexity to the problem. It is prohibitively difficult to solve for gametheoretic equilibria in all, but the simplest twosided bandit problems. In fact, even defining what would constitute optimal behavior with unbounded computational resources and a perfect model of the world is hard to do. The focus of research should be on good algorithms and what we can show about these algorithms in general settings. This approach is also related to the theory of learning in games[5], which considers more generally how individual learning rules affect outcomes in games and whether agents reach static equilibria. Various choices have to be made in modeling a twosided matching game with learning. These choices relate to the structure of repetition in the game, the payoffs agents receive and when they receive them, how agent preferences over the other side of the market are defined, how much information agents have about themselves and others, what the space of actions available to agents is, and how agents are matched based on the actions they take. Progress:In previous work[1], we report empirical results that focus on the stability of outcomes in dating markets under different matching mechanisms when agents use a simple epsilongreedy exploration policy and either sample means or a leaky integrator for estimating values of agents on the other side of the market and the probabilities of offers being made or accepted. We consider three matching mechanisms in the above paper. In GaleShapley matching each agent submits a list of preferences and a centralized matching procedure produces a matching based on these lists. The GaleShapley algorithm[6] guarantees a matching that is stable under the submitted preferences. With simultaneous offers, each woman independently chooses one man to make an offer to. Each man selects one of the offers he receives. Women who are rejected are unmatched for the period, as are men who receive no offers. Finally, under the sequential offer mechanism, each woman independently chooses one man to make an offer to. The offers, however, are randomly ordered and men must decide on these "exploding" offers without knowing what other offers are coming. If an offer is rejected the woman making the offer is unmatched in that period. A man is unmatched if he rejects all offers he receives. Our results with 5 men and 5 women (see Figure 1) confirm our intuitions that under GaleShapley matching agents always converge to a stable matching (even for small epsilon) and that the sequential problem is "harder" than the simultaneous problem.
Future WorkWe are interested in modeling other markets in addition to the dating market using the twosided bandit framework. Ideally, we would like to present a model that is analytically tractable under certain learning and decisionmaking algorithms so that we can prove statements about the convergence and stability properties of algorithms. Other possible extensions involve exploring learning algorithms that allow agents to perform well across a broad range of environments without having to make assumptions about the decisionmaking algorithms or learning processes of other agents. In addition to simpler models, we are also interested in more complex versions of the problem, for example, describing the behavior of simple learning rules with a greater diversity of preferences and a larger number of agents. AcknowledgmentsThis report describes research done at the Center for Biological & Computational Learning, which is in the McGovern Institute for Brain Research at MIT, as well as in the Dept. of Brain & Cognitive Sciences, and which is affiliated with the Computer Sciences & Artificial Intelligence Laboratory (CSAIL). This research was sponsored by grants from: Office of Naval Research (DARPA) Contract No. MDA9720410037, Office of Naval Research (DARPA) Contract No. N000140210915, National Science Foundation (ITR/SYS) Contract No. IIS0112991, National Science Foundation (ITR) Contract No. IIS0209289, National Science FoundationNIH (CRCNS) Contract No. EIA0218693, National Science FoundationNIH (CRCNS) Contract No. EIA0218506, and National Institutes of Health (Conte) Contract No. 1 P20 MH6623901A1. Additional support was provided by: Central Research Institute of Electric Power Industry (CRIEPI), DaimlerChrysler AG, Compaq/Digital Equipment Corporation, Eastman Kodak Company, Honda R&D Co., Ltd., Industrial Technology Research Institute (ITRI), Komatsu Ltd., Eugene McDermott Foundation, MerrillLynch, NEC Fund, Oxygen, Siemens Corporate Research, Inc., Sony, Sumitomo Metal Industries, and Toyota Motor Corporation. References[1] Sanmay Das and Emir Kamenica. Twosided bandits and the dating market. . To appear in Proceedings of the Nineteenth International Joint Conference on Artificial Intelligence, Edinburgh, Scotland, August 2005. [2] Alvin E. Roth and Marilda Sotomayor. TwoSided Matching: A Study in GameTheoretic Modeling and Analysis. Econometric Society Monograph Series. Cambridge University Press, Cambridge, UK, 1990. [3] K. Burdett and R. Wright. Twosided search with nontransferable utility. Review of Economic Dynamics, 1:220245, 1998. [4] D. A. Berry and B. Fristedt. Bandit Problems: Sequential Allocation of Experiments. Monographs on Statistics and Applied Probability. Chapman and Hall, London, UK, 1985. [5] D. Fudenberg and D. K. Levine. The Theory of Learning in Games. MIT Press, Cambridge, MA, 1998. [6] D. Gale and L. S. Shapley. College admissions and the stability of marriage. The American Mathematical Monthly, 69(1):915, 1962. 

