The multiplicative weights update method: a meta algorithm and applications

Algorithms in varied fields use the idea of maintaining a distribution over a certain set and use the multiplicative update rule to iteratively change these weights. Their analyses are usually very similar and rely on an exponential potential function. In this survey we present a simple meta-algorithm that unifies many of these disparate algorithms and derives them as simple instantiations of the meta-algorithm. We feel that since this meta-algorithm and its analysis are so simple, and its applications so broad, it should be a standard part of algorithms courses, like "divide and conquer."


Introduction
The Multiplicative Weights (MW) method is a simple idea which has been repeatedly discovered in fields as diverse as Machine Learning, Optimization, and Game Theory.The setting for this algorithm is the following.A decision maker has a choice of n decisions, and needs to repeatedly make a decision and obtain an associated payoff.The decision maker's goal, in the long run, is to achieve a total payoff which is comparable to the payoff of that fixed decision that maximizes the total payoff with the benefit of hindsight.While this best decision may not be known a priori, it is still possible to achieve this goal by maintaining weights on the decisions, and choosing the decisions randomly with probability proportional to the weights.In each successive round, the weights are updated by multiplying them with factors which depend on the payoff of the associated decision in that round.Intuitively, this scheme works because it tends to focus higher weight on higher payoff decisions in the long run.
This idea lies at the core of a variety of algorithms.Some examples include: the Ada Boost algorithm in machine learning [26]; algorithms for game playing studied in economics (see references later), the Plotkin-Shmoys-Tardos algorithm for packing and covering LPs [56], and its improvements in the case of flow problems by Young [65], Garg-Könemann [29,30], Fleischer [24] and others; methods for convex optimization like exponentiated gradient (mirror descent), Lagrangian multipliers, and subgradient methods, Impagliazzo's proof of the Yao XOR lemma [40], etc.The analysis of the running time uses a potential function argument and the final running time is proportional to 1/ε 2 .
It has been clear to several researchers that these results are very similar.For example Khandekar's Ph.D. thesis [46] makes this point about the varied applications of this idea to convex optimization.The purpose of this survey is to clarify that many of these applications are instances of the same, more general algorithm (although several specialized applications, such as [53], require additional technical work).This meta-algorithm is very similar to the "Hedge" algorithm from learning theory [26].Similar algorithms have been independently rediscovered in many other fields; see below.The advantage of deriving the above algorithms from the same meta-algorithm is that this highlights their commonalities as well as their differences.To give an example, the algorithms of Garg-Könemann [29,30] were felt to be quite different from those of Plotkin-Shmoys-Tardos [56].In our framework, they can be seen as a clever trick for "width reduction" for the Plotkin-Shmoys-Tardos algorithms (see Section 3.4).
We feel that this meta-algorithm and its analysis are simple and useful enough that they should be viewed as a basic tool taught to all algorithms students together with divide-and-conquer, dynamic programming, random sampling, and the like.Note that the multiplicative weights update rule may be seen as a "constructive" version of LP duality-equivalently, von Neumann's minimax theorem in game theory-and it gives a fairly concrete method for competing players to arrive at a solution/equilibrium (see Section 3.2).This may be an appealing feature in introductory algorithms courses, since the standard algorithms for LP such as simplex, ellipsoid, or interior point lack such a game-theoretic interpretation.Furthermore, it is a convenient stepping point to many other topics that rarely get mentioned in algorithms courses, including online algorithms (see the basic scenario in Section 1.1) and machine learning.Furthermore our proofs seem easier and cleaner than the entropy-based proofs for the same results in machine learning (although the proof technique we use here has been used before, see for example Blum's survey [10]).
The current paper is chiefly a survey.It introduces the main algorithm, gives a few variants (mostly having to do with the range in which the payoffs lie), and surveys the most important applications-often with complete proofs.Note however that this survey does not cover all applications of the technique, as several of these require considerable additional technical work which is beyond the scope of this paper.We have provided pointers to some such applications which use the multiplicative weights technique at their core without going into more details.There are also a few small results that appear to be new, such as the variant of the Garg-Könemann algorithm in Section 3.4 and the lower bound in Section 4.
THEORY OF COMPUTING, Volume 8 (2012), pp.121-164 Related work.An algorithm similar in flavor to the Multiplicative Weights algorithm was proposed in game theory in the early 1950's [13,12,59].Following Brown [12], this algorithm was called "Fictitious Play": at each step each player observes actions taken by his opponent in previous stages, updates his beliefs about his opponents' strategies, and chooses the pure best response against these beliefs.In the simplest case, the player simply assumes that the opponent is playing from a stationary distribution and sets his current belief of the opponent's distribution to be the empirical frequency of the strategies played by the opponent.This simple idea (which was shown to lead to optimal solutions in the limit in various cases) led to numerous developments in economics, including Arrow-Debreu General Equilibrium theory and more recently, evolutionary game theory.Grigoriadis and Khachiyan [33] showed how a randomized variant of "Fictitious Play" can solve two player zero-sum games efficiently.This algorithm is precisely the multiplicative weights algorithm.It can be viewed as a soft version of fictitious play, when the player gives higher weight to the strategies which pay off better, and chooses her strategy using these weights rather than choosing the best response strategy.
In Machine Learning, the earliest form of the multiplicative weights update rule was used by Littlestone in his well-known Winnow algorithm [50,51].It is somewhat reminiscent of the older perceptron learning algorithm of Minsky and Papert [55].The Winnow algorithm was generalized by Littlestone and Warmuth [52] in the form of the Weighted Majority algorithm, and later by Freund and Schapire in the form of the Hedge algorithm [26].We note that most relevant papers in learning theory use an analysis that relies on entropy (or its cousin, Kullback-Leibler divergence) calculations.This analysis is closely related to our analysis, but we use exponential functions instead of the logarithm, or entropy, used in those papers.The underlying calculation is the same: whereas we repeatedly use the fact that e x ≈ 1 + x when |x| is small, they use the fact that ln(1 + x) ≈ x.We feel that our approach is cleaner (although the entropy based approach yields somewhat tighter bounds that are useful in some applications, see Section 2.2).
Other applications of the multiplicative weights algorithm in computational geometry include Clarkson's algorithm for linear programming with a bounded number of variables in linear time [20,21].Following Clarkson, Brönnimann and Goodrich use similar methods to find Set Covers for hypergraphs with small VC dimension [11].
The weighted majority algorithm as well as more sophisticated versions have been independently discovered in operations research and statistical decision making in the context of the On-line decision problem; see the surveys of Cover [22], Foster and Vohra [25], and also Blum [10] who includes applications of weighted majority to machine learning.A notable algorithm, which is different from but related to our framework, was developed by Hannan in the 1950's [34].Kalai and Vempala showed how to derive efficient algorithms via methods similar to Hannan's [43].We show how Hannan's algorithm with the appropriate choice of parameters yields the multiplicative update decision rule in Section 3.8.
Within computer science, several researchers have previously noted the close relationships between multiplicative update algorithms used in different contexts.Young [65] notes the connection between fast LP algorithms and Raghavan's method of pessimistic estimators for derandomization of randomized rounding algorithms; see our Section 3.5.Klivans and Servedio [49] relate boosting algorithms in learning theory to proofs of Yao's XOR Lemma; see our Section 3.6.Garg and Khandekar [28] describe a common framework for convex optimization problems that contains Garg-Könemann and Plotkin-Shmoys-Tardos as subcases.
To the best of our knowledge our framework is the most general and, arguably, the simplest.We readily acknowledge the influence of all previous papers (especially Young [65] and Freund-Schapire [27]) on the development of our framework.We emphasize again that we do not claim that every algorithm designed using the multiplicative update idea fits in our framework, just that most do.
Paper organization.We proceed to define the illustrative weighted majority algorithm in this section.In Section 2 we describe the general MW meta-algorithm, followed by numerous and varied applications in Section 3. In Section 4 we give lower bounds, followed by the more general matrix MW algorithm in Section 5.

The weighted majority algorithm
Now we briefly illustrate the weighted majority algorithm in a simple and concrete setting, which will naturally lead to our generalized meta-algorithm.This is known as the Prediction from Expert Advice problem.
Imagine the process of picking good times to invest in a stock.For simplicity, assume that there is a single stock of interest, and its daily price movement is modeled as a sequence of binary events: up/down.(Below, this will be generalized to allow non-binary events.)Each morning we try to predict whether the price will go up or down that day; if our prediction happens to be wrong we lose a dollar that day, and if it's correct, we lose nothing.
The stock movements can be arbitrary and even adversarial.To balance out this pessimistic assumption, we assume that while making our predictions, we are allowed to watch the predictions of n "experts."These experts could be arbitrarily correlated, and they may or may not know what they are talking about.The algorithm's goal is to limit its cumulative losses (i.e., bad predictions) to roughly the same as the best of these experts.At first sight this seems an impossible goal, since it is not known until the end of the sequence who the best expert was, whereas the algorithm is required to make predictions all along.
Indeed, the first algorithm one thinks of is to compute each day's up/down prediction by going with the majority opinion among the experts that day.But, this algorithm doesn't work because a majority of experts may be consistently wrong on every single day.
The weighted majority algorithm corrects the trivial algorithm.It maintains a weighting of the experts.Initially all have equal weight.As time goes on, some experts are seen as making better predictions than others, and the algorithm increases their weight proportionately.The algorithm's prediction of up/down for each day is computed by going with the opinion of the weighted majority of the experts for that day.
Theorem 1.1.After T steps, let m i (T ) be the number of mistakes of expert i and M (T ) be the number of mistakes our algorithm has made.Then we have the following bound for every i: In particular, this holds for i which is the best expert, i. e., having the least m i (T ) .

Weighted majority algorithm
Initialization: Fix an η ≤ 1 2 .With each expert i, associate the weight w i (1) := 1.For t = 1, 2, . . ., T : 1. Make the prediction that is the weighted majority of the experts' predictions based on the weights w 1 (t) , . . ., w n (t) .That is, predict "up" or "down" depending on which prediction has a higher total weight of experts advising it (breaking ties arbitrarily).

2.
For every expert i who predicts wrongly, decrease his weight for the next round by multiplying it by a factor of (1 − η): (update rule). (1.1) (2/η) ln n we see that the number of mistakes made by the algorithm is bounded from above by roughly 2(1 + η)m i (T ) , i. e., approximately twice the number of mistakes made by the best expert.This is tight for any deterministic algorithm.However, the factor of 2 can be removed by substituting the above deterministic algorithm by a randomized algorithm that predicts according to the majority opinion with probability proportional to its weight.(In other words, if the total weight of the experts saying "up" is 3/4 then the algorithm predicts "up" with probability 3/4 and "down" with probability 1/4.)Then the number of mistakes after T steps is a random variable and the claimed upper bound holds for its expectation (see Section 2 for more details).
Proof.A simple induction shows that w i (t+1) = (1 − η) m i (t) .Let Φ (t) = ∑ i w i (t) ("the potential function").Thus Φ (1) = n.Each time we make a mistake, the weighted majority of experts also made a mistake, so at least half the total weight decreases by a factor 1 − η.Thus, the potential function decreases by a factor of at least (1 − η/2): Thus simple induction gives ) for all i, the claimed bound follows by comparing the above two expressions and using the fact that The beauty of this analysis is that it makes no assumption about the sequence of events: they could be arbitrarily correlated and could even depend upon our current weighting of the experts.In this sense, this algorithm delivers more than initially promised, and this lies at the root of why (after obvious generalization) it can give rise to the diverse algorithms mentioned earlier.In particular, the scenario where the events are chosen adversarially resembles a zero-sum game, which we consider later in Section 3.2.

The Multiplicative Weights algorithm
In the general setting, we have a set of n decisions and in each round, we are required to select one decision from the set.In each round, each decision incurs a certain cost, determined by nature or an adversary.All the costs are revealed after we choose our decision, and we incur the cost of the decision we chose.For example, in the prediction from expert advice problem, each decision corresponds to a choice of an expert, and the cost of an expert is 1 if the expert makes a mistake, and 0 otherwise.
To motivate the Multiplicative Weights (MW) algorithm, consider the naïve strategy that, in each iteration, simply picks a decision at random.The expected penalty will be that of the "average" decision.Suppose now that a few decisions are clearly better in the long run.This is easy to spot as the costs are revealed over time, and so it is sensible to reward them by increasing their probability of being picked in the next round (hence the multiplicative weight update rule).
Intuitively, being in complete ignorance about the decisions at the outset, we select them uniformly at random.This maximum entropy starting rule reflects our ignorance.As we learn which ones are the good decisions and which ones are bad, we lower the entropy to reflect our increased knowledge.The multiplicative weight update is our means of skewing the distribution.
We now set up some notation.Let t = 1, 2, . . ., T denote the current round, and let i be a generic decision.In each round t, we select a distribution p (t) over the set of decisions, and select a decision i randomly from it.At this point, the costs of all the decisions are revealed by nature in the form of the vector m (t) such that decision i incurs cost m i (t) .We assume that the costs lie in the range [−1, 1].This is the only assumption we make on the costs; nature is completely free to choose the cost vector as long as these bounds are respected, even with full knowledge of the distribution that we choose our decision from.
The expected cost to the algorithm for sampling a decision i from the distribution p (t) is The total expected cost over all rounds is therefore ∑ T t=1 m (t) • p (t) .Just as before, our goal is to design an algorithm which achieves a total expected cost not too much more than the cost of the best decision in hindsight, viz.min i ∑ T t=1 m i (t) .Consider the following algorithm, which we call the Multiplicative Weights Algorithm.This algorithm has been studied before as the prod algorithm of Cesa-Bianchi, Mansour, and Stoltz [17], and Theorem 2.1 can be seen to follow from Lemma 2 in [17].
The following theorem-completely analogous to Theorem 1.1-bounds the total expected cost of the Multiplicative Weights algorithm (given in Figure 1) in terms of the total cost of the best decision: Theorem 2.1.Assume that all costs m i (t) ∈ [−1, 1] and η ≤ 1/2.Then the Multiplicative Weights algorithm guarantees that after T rounds, for any decision i, we have
2. Observe the costs of the decisions m (t) .
3. Penalize the costly decisions by updating their weights as follows: for every decision i, set Figure 1: The Multiplicative Weights algorithm.
Proof.The proof is along the lines of the earlier one, using the potential function ≤ Φ (t) exp(−ηm (t) • p (t) ) .
Here, we used the fact that p i (t) = w i (t) /Φ (t) .Thus, by induction, after T rounds, we have Next we use the following facts, which follow immediately from the convexity of the exponential function: , we have for every decision i, where the subscripts "≥ 0" and "< 0" in the summations refer to the rounds t where m i (t) is ≥ 0 and < 0 respectively.Taking logarithms in equations (2.2) and (2.4) we get: Negating, rearranging, and scaling by 1/η: In the second inequality we used the facts that Corollary 2.2.The Multiplicative Weights algorithm also guarantees that after T rounds, for any distribution p on the decisions, where |m (t) | is the vector obtained by the taking the coordinate-wise absolute value of m (t) .
Proof.This corollary follows immediately from Theorem 2.1, by taking a convex combination of the inequalities for all decisions i with the distribution p.

Updating with exponential factors: the Hedge algorithm
In our description of the MW algorithm, the update rule uses multiplication by a linear function of the cost (specifically, (1 − ηm i (t) ) for expert i).In several other incarnations of MW algorithm, notably the Hedge algorithm of Freund and Schapire [26], an exponential factor is used instead.This update rule is the following: As can be seen from the analysis of the MW algorithm, Hedge is not very different.The bound we obtain for Hedge is slightly different however.While most of the applications we present in the rest of the paper can be derived using Hedge as well with a little extra calculation, some applications, such as the ones in Sections 3.3 and 3.5, explicitly need the MW algorithm rather than Hedge to obtain better bounds.Here, we state the bound obtained for Hedge without proof-the analysis is on the same lines as before.The only difference is that instead of the inequalities (2.3), we use the inequality Theorem 2.3.Assume that all costs m i (t) ∈ [−1, 1] and η ≤ 1.Then the Hedge algorithm guarantees that after T rounds, for any decision i, we have Here, (m (t) ) 2 is the vector obtained by taking coordinate-wise square of m (t) .
This guarantee is very similar to the one in Theorem 2.1, with one important difference: the term multiplying η is a loss which depends on the algorithm's distribution.In Theorem 2.1, this additional term depends on the loss of the best decision in hindsight.For some applications the latter guarantee is stronger (see Section 3.3).

Proof via KL-divergence
In this section, we give an alternative proof of Theorem 2.1 based on the Kullback-Leibler (KL) divergence, or relative entropy.While this proof is somewhat more complicated, it gives a good insight into why the MW algorithm works: the reason is that it tends to reduce the KL-divergence to the optimal solution.Another reason for giving this proof is that it yields a more nuanced form of the MW algorithm that is useful in some applications (such as the construction of hard-core sets, see Section 3.7).Readers may skip this section without loss of continuity.
For two distributions p and q on the decision set, the relative entropy between them is where the term p i ln(p i /q i ) is defined to be zero if p i = 0 and infinite if p i = 0 , q i = 0. Consider the following twist on the basic decision-making problem from Section 2. Fix a convex subset of distributions over decisions, P (note: the basic setting is recovered when P is the set of all distributions).In each round t, the decision-maker is required to produce a distribution p (t) ∈ P. At that point, the cost vector m (t) is revealed and the decision-maker suffers cost m (t) • p (t) .Since we make the restriction that p (t) ∈ P, we now want to compare the total cost of the decision-maker to the cost of the best fixed distribution in P. Consider the algorithm in Figure 2.
Note that in the special case when P is the set of all distributions on the decisions, this algorithm is exactly the basic MW algorithm presented in Figure 1.The relative entropy projection step ensures that Multiplicative Weights Update algorithm with Restricted Distributions Initialization: Fix a η ≤ 1 2 .Set p (1) to be be an arbitrary distribution in P initialized to 1.For t = 1, 2, . . ., T : 1. Choose decision i by sampling from p (t) .
2. Observe the costs of the decisions m (t) .
3. Compute the probability vector p(t+1) using the usual multiplicative update rule: for every expert i, where Φ (t) is the normalization factor to make p(t+1) a distribution.
4. Set p (t+1) to be the relative entropy projection of p(t) on the set P, i. e., we always choose a distribution in P.This projection is a convex program since relative entropy is convex and P is a convex set, and hence can be computed using standard convex programming techniques.
We now prove a bound on the total cost of the algorithm (compare to Corollary 2.2).Note that in the basic setting when P is the set of all distributions, the bound given below is tighter than the one in Theorem 2.1.
Theorem 2.4.Assume that all costs m i (t) ∈ [−1, 1] and η ≤ 1/2.Then the Multiplicative Weights algorithm with Restricted Distributions guarantees that after T rounds, for any p ∈ P, we have where |m (t) | is the vector obtained by taking the coordinate-wise absolute value of m (t) .
Proof.We use the relative entropy between p and p (t) , RE(p p (t) ) := ∑ i p i ln(p i /p i (t) ) as a "potential" function.We have THEORY OF COMPUTING, Volume 8 (2012), pp.121-164 The first inequality above follows from (2.3), and the second follows from (2.5).Next, we have ln since ln(1 − x) ≤ −x for x < 1.Thus, we get This inequality essentially says that if the cost of the algorithm in round t, m (t) • p (t) , is significantly larger than the cost of the comparator, m (t) • p, then pt+1 moves closer to p (in the relative entropy distance) than p (t) .Now, projection on the set P using the relative entropy as a distance function is a Bregman projection, and thus it satisfies the following Generalized Pythagorean inequality (see, e. g., [39]), for any p ∈ P: I. e., the projection step only brings the distribution closer to p. Since relative entropy is always nonnegative, we have RE(p (t+1) p(t+1) ) ≥ 0 and so Summing up from t = 1 to T , dividing by η, and simplifying using the fact that RE(p p (T +1) ) is non-negative, we get the stated bound.

Gains instead of losses
There are situations where it makes more sense for the vector m (t) to specify gains for each expert rather than losses.Now our goal is to get as much total expected payoff as possible in comparison to the total payoff of the best expert.We can get an algorithm for this case simply by running the Multiplicative Weights algorithm using the cost vector −m (t) .
The resulting algorithm is identical, and the following theorem follows directly from Theorem 2.1 by simply negating the quantities: Theorem 2.5.Assume that all gains m i (t) ∈ [−1, 1] and η ≤ 1.Then the Multiplicative Weights algorithm (for gains) guarantees that after T rounds, for any expert i, we have We also have the following immediate corollary, corresponding to Corollary 2.2: Corollary 2.6.The Multiplicative Weights algorithm also guarantees that after T rounds, for any distribution p on the decisions, where |m (t) | is the vector obtained by the taking the coordinate-wise absolute value of m (t) .

Applications
Typically, the Multiplicative Weights method is applied in the following manner.A prototypical example is to solve a constrained optimization problem.We then let a decision represent each constraint in the problem, with costs specified by the points in the domain of interest.For a given point, the cost of a decision is made proportional to how well the corresponding constraint is satisfied on the point.This might seem counterintuitive, but recall that we reduce a decision's weight depending on its penalty, and if a constraint is well satisfied on points so far we would like its weight to be smaller, so that the algorithm focuses on constraints that are poorly satisfied.In many applications (though not all) the choice of point is also under our control.Typically we will need to generate the maximally adversarial point, i. e., the point that maximizes the expected cost.Then the overall algorithm consists of two subprocedures: an "oracle" for generating the maximally adversarial point at each step, and the MW algorithm for updating the weights of the decisions.With this intuition, we can describe the following applications.

Learning a linear classifier: the Winnow algorithm
To the best of our knowledge, the first time multiplicative weight updates were used was in the Winnow algorithm of Littlestone [50].This is an algorithmic technique used in machine learning to learn linear classifiers.Equivalently, this can also be seen as solving a linear program.
The setup is as follows.We are given m labeled examples, (a 1 , 1 ), (a 2 , 2 ), . . ., (a m , m ) where a j ∈ R n are feature vectors, and j ∈ {−1, 1} are their labels.Our goal is to find non-negative weights such that for any example, the sign of the weighted combination of the features matches its labels, i. e., find x ∈ R n with x i ≥ 0 such that for all j = 1, 2, . . ., m, we have sgn(a j • x) = j .Equivalently, we require that j a j • x ≥ 0 for all j.Without loss of generality, we may assume that the weights sum to 1 so that they form a distribution, i. e., 1 • x = 1, where 1 is the all 1's vector.
Thus, for notational convenience, if we redefine a j to be j a j , then the problem reduces to finding a solution to the following LP: Note this is a quite general form of LP, and many commonly seen LPs can be reduced to this form.Now suppose there is a large-margin solution to this problem.I. e., there is an ε > 0 and a distribution x so that for all j, we have a j • x ≥ ε.We now give an algorithm based on MW to solve the LP above.Define ρ = max j a j ∞ .
We run the MW algorithm in the gain form (see Section 2.3) with η = ε/(2ρ).The decisions are given by the n features, and gains are specified by the m examples.The gain for feature i for example j is a i j /ρ.Note that these gains lie in the range [−1, 1] as required.
In each round t, let x to be the distribution p (t) generated by the MW algorithm.Now, we look for a misclassified example, i. e., an example j such that a j • x < 0. If no such constraint exists, we are done and we can stop.Otherwise, if j is a misclassified example, then it specifies the gains for round t.Note that the gain in round t is m (t) • p (t) = (1/ρ)a j • x < 0, whereas for the solution x , we have We keep running the MW algorithm until we find a good solution (i.e., one that classifies all examples correctly).
To get a bound on the number of iterations until we find a good solution, we apply Corollary 2.6 with p = x .Using the trivial bound which implies that T < 4ρ 2 ln(n)/ε 2 .Thus, in at most 4ρ 2 ln(n)/ε 2 iterations, we find a good solution.

Solving zero-sum games approximately
We show how our general algorithm above can be used to approximately solve zero-sum games.(This is a duplication of the results of Freund and Schapire [27], who gave the same algorithm but a different proof of convergence that used KL-divergence.Furthermore, convergence of simple algorithms to zero-sum game equilibria were studied earlier in [34].)Let A be the payoff matrix of a finite 2-player zero-sum game, with n rows (the number of columns will play no role).When the row player plays strategy i and the column player plays strategy j, then the payoff to the column player is A(i, j) := A i j .We assume that A(i, j) ∈ [0, 1].If the row player chooses his strategy i from a distribution p over the rows, then the expected payoff to the column player for choosing a strategy j is A(p, j) := E i∈p [A(i, j)].Thus, the best response for the column player is the strategy j which maximizes this payoff.Similarly, if the column player chooses his strategy j from a distribution q over the columns, then the expected payoff he gets if the row player chooses the strategy i is A(i, q) := E j∈q [A(i, j)].Thus, the best response for the row player is the strategy i which minimizes this payoff.John von Neumann's min-max theorem says that if each of the players chooses a distribution over their strategies to optimize their worst case payoff (or payout), then the value they obtain is the same: where p (resp., q) varies over all distributions over rows (resp., columns).Also, i (resp., j) varies over all rows (resp., columns).The common value of these two quantities, denoted λ * , is known as the value of the game.Let ε > 0 be an error parameter.We wish to approximately solve the zero-sum game up to additive error of ε, namely, find mixed row and column strategies p and q such that The algorithmic assumption about the game is that given any distribution p on decisions, we have an efficient way to pick the best response, namely, the pure column strategy j that maximizes A(p, j).This quantity is at least λ * from the definition above.Call this algorithm the ORACLE.Theorem 3.1.Given an error parameter ε > 0, there is an algorithm which solves the zero-sum game up to an additive factor of ε using O(log(n)/ε 2 ) calls to ORACLE, with an additional processing time of O(n) per call.
Proof.We map our general algorithm from Section 2 to this setting by considering (3.3) as specifying n linear constraints on the probability vector q: viz., for all rows i, A(i, q) ≥ λ * − ε.Now, following the intuition given in the beginning of this section, we make our decisions to correspond to pure strategies of the row player.Thus a distribution on the decisions corresponds to a mixed row strategy.Costs of the decisions are specified by pure strategies of the column player.The cost paid by a decision i when column player chooses strategy j is A(i, j).
In each round, given a distribution p (t) on the rows, we will choose the column j (t) to be the best response strategy to p (t) for the column player, by calling ORACLE.Thus, the cost vector m (t) is the j (t) -th column of the matrix A.
Since all A(i, j) ∈ [0, 1], we can apply Corollary 2.2 to get that after T rounds, for any distribution on the rows p, we have Dividing by T , and using the fact that A(p, j (t) ) ≤ 1 and that for all t, A(p (t) , j (t) ) ≥ λ * , we get Setting p = p * , the optimal row strategy, we have A(p, j) ≤ λ * for any j.By setting η = ε/2 and T = 4 ln(n)/ε 2 , we get that ) is an (additive) ε-approximation to λ * .Let t be the round t with the minimum value of A(p (t) , j (t) ).We have, from (3.5), Since j (t) maximizes A(p (t) , j) over all j, we conclude that p (t) is an approximately optimal mixed row strategy, and thus we can set p * := p (t) . 1 1 Alternatively, we can set p * = (1/T ) ∑ t p (t) .For let j * be the optimal column player response to p * .Then we have THEORY OF COMPUTING, Volume 8 (2012), pp.121-164 We set q * to be the distribution which assigns to column j the probability |{t : j (t) = j}| T .
From (3.5), for any row strategy i, by setting p to be concentrated on the pure strategy i, we have which shows that q * is an approximately optimal mixed column strategy.

Plotkin, Shmoys, Tardos framework for packing/covering LPs
Plotkin, Shmoys, and Tardos [56] generalized some known flow algorithms to a framework for approximately solving fractional packing and covering problems, which are a special case of linear programming formally defined below.Their algorithm is a quantitative version of the classical Lagrangean relaxation idea, and applies also to general linear programs.Below, we derive the algorithm for general LPs and then mention the slight modification that yields better running time for packing-covering LPs.Also, we note that we could derive this algorithm as a special case of game solving, but for concreteness we describe it explicitly.
The basic problem is to check the feasibility of the following convex program: where A ∈ R m×n is an m × n matrix, x ∈ R n , and P is a convex set in R n .Intuitively, the set P represents the "easy" constraints to satisfy, such as non-negativity, and A represents the "hard" constraints to satisfy.We wish to design an algorithm that, given an error parameter ε > 0, either solves the problem to an additive error of ε, i. e., finds an x ∈ P such that for all i, A i x ≥ b i − ε, or failing that, proves that the system is infeasible.Here, A i is the ith row of A.
We assume the existence of an algorithm, called ORACLE, which, given a probability vector p on the m constraints, solves the following feasibility problem: One way to implement this procedure is by maximizing p Ax over x ∈ P. It is reasonable to expect such an optimization procedure to exist (indeed, such is the case for many applications) since we only need to check the feasibility of one constraint rather than m.If the feasibility problem (3.6) has a solution x * , then the same solution also satisfies (3.7) for any probability vector p over the constraints.Thus, if there is a probability vector p over the constraints such that no x ∈ P satisfies (3.7), then it is proof that the original problem is infeasible.We assume that the ORACLE satisfies the following technical condition, which is necessary for deriving running time bounds.Definition 3.2.An ( , ρ)-bounded ORACLE, for parameters 0 ≤ ≤ ρ, is an algorithm which given a probability vector p over the constraints, solves the feasibility problem (3.7).Furthermore, there is a fixed subset I ⊆ [m] of constraints such that whenever the ORACLE manages to find a point x ∈ P satisfying (3.7), the following holds: The value ρ is called the width of the problem.
In previous work, such as [56], only (ρ, ρ)-bounded ORACLEs are considered.We separate out the upper and lower bounds in order to obtain tighter guarantees on the running time.The results of [56] can be recovered simply by setting = ρ.Theorem 3.3.Let ε > 0 be a given error parameter.Suppose there exists an ( , ρ)-bounded ORACLE for the feasibility problem (3.7).Assume that ≥ ε/2.Then there is an algorithm which either finds an x such that ∀i .A i x ≥ b i − ε, or correctly concludes that the system is infeasible.The algorithm makes only O( ρ log(m)/ε 2 ) calls to the ORACLE, with an additional processing time of O(m) per call.
Proof.The condition ≥ ε/2 is only technical, and if it is not met we can just redefine to be ε/2.To map our general framework to this situation, we have a decision representing each of the m constraints.Costs are determined by points x ∈ P. The cost of constraint i for point x is In each round t, given a distribution over the decisions (i.e., the constraints) p (t) , we run the ORACLE with p (t) .If the ORACLE declares that there is no x ∈ P such that p (t) Ax ≥ p (t) b, then we stop, because now p (t) is proof that the problem (3.6) is infeasible.
So let us assume that this doesn't happen, i. e., in all rounds t, the ORACLE manages to find a solution x (t) such p (t) Ax ≥ p (t) b.Since the cost vector to the Multiplicative Weights algorithm is specified to be m (t) := (1/ρ)[Ax (t) − b], we conclude that the expected cost in each round is non-negative: Let i ∈ I. Then Theorem 2.1 tells us that after T rounds, Here, the subscript "< 0" refers to the rounds t when A i x (t) − b i < 0. The last inequality follows because if Dividing by T , multiplying by ρ, and letting x = (1/T ) ∑ T t=1 x (t) (note that x ∈ P since P is a convex set), we get that Now, if we choose η = ε/(4 ) (note that η ≤ 1/2 since ≥ ε/2), and T = 8 ρ ln(m)/ε 2 , we get that Reasoning similarly for i / ∈ I, we get the same inequality.Putting both together, we conclude that x satisfies the feasibility problem (3.6) up to an additive ε factor, as desired.

Concave constraints
The algorithm of Section 3.3 works not just for linear constraints over a convex domain, but also for concave constraints.Imagine that we have the following feasibility problem: where, as before, P ∈ R n is a convex domain, and for i ∈ [m], f i : P → R are concave functions.We wish to satisfy this system approximately, up to an additive error of ε.Again, we assume the existence of an ORACLE, which, when given a probability distribution p = p 1 , p 2 , . . ., p m , solves the following feasibility problem: An ORACLE would be called ( , ρ)-bounded there is a fixed subset of constraints I ⊆ [m] such that whenever it returns a feasible solution x to (3.9), all constraints i ∈ I take values in the range [− , ρ] on the point x, and all the rest take values in [−ρ, ].
Theorem 3.4.Let ε > 0 be a given error parameter.Suppose there exists an ( , ρ)-bounded ORACLE for the feasibility problem (3.8).Assume that ≥ ε/2.Then there is an algorithm which either solves the problem up to an additive error of ε, or correctly concludes that the system is infeasible, making only O( ρ log(m)/ε 2 ) calls to the ORACLE, with an additional processing time of O(m) per call.
Proof.Just as before, we have a decision for every constraint, and costs are specified by points x ∈ P.
The cost of constraint i for point x is (1/ρ) f i (x).Now we run the Multiplicative Weights algorithm with this setup.Again, if at any point the ORACLE declares that (3.9) is infeasible, we immediately halt and declare the system (3.8)infeasible.So assume this never happens.Then as before, the expected cost in each round is m (t) • p (t) ≥ 0. Now, applying Theorem 2.1 as before, we conclude that for any i ∈ I, we have Dividing by T , multiplying by ρ, and letting x = (1/T ) ∑ T t=1 x (t) (note that x ∈ P since P is a convex set), we get that , by Jensen's inequality, since all the f i are concave.Now, if we choose η = ε/4 (note that η ≤ 1/2 since ≥ ε/2), and T = 8 ρ ln(m)/ε 2 , we get that Reasoning similarly for i / ∈ I, we get the same inequality.Putting both together, we conclude that x satisfies the feasibility problem (3.8) up to an additive ε factor, as desired.

Approximate ORACLEs
The algorithm described in the previous section allows some slack for the implementation of the ORACLE.This slack is very useful in designing efficient implementations for the ORACLE.
Define a ε-approximate ORACLE for the feasibility problem (3.6) to be one that solves the feasibility problem (3.7) up to an additive error of ε.That is, given a probability vector p on the constraints, either it finds an x ∈ P such that p Ax ≥ p b − ε, or it declares correctly that (3.7) is infeasible.Theorem 3.5.Let ε > 0 be a given error parameter.Suppose there exists an ( , ρ)-bounded ε/3approximate ORACLE for the feasibility problem (3.6).Assume that ≥ ε/3.Then there is an algorithm which either solves the problem up to an additive error of ε, or correctly concludes that the system is infeasible, making only O( ρ log(m)/ε 2 ) calls to the ORACLE, with an additional processing time of O(m) per call.
Proof.We run the algorithm of the previous section with the given ORACLE, setting η = ε/6 .Now, in every round, the expected payoff is at least −ε/3ρ.Simplifying as before, we get that after T rounds, we have, the average point x = (1/T ) ∑ T t=1 x (t) returned by the ORACLE satisfies Now, if T = 18 ρ ln(m)/ε 2 , then we get that for all i, A i x ≥ b i − ε, as required.

Fractional Covering Problems
In fractional covering problems, the framework is the same as above, with the crucial difference that the coefficient matrix A is such that Ax ≥ 0 for all x ∈ P, and b > 0. A ε-approximation solution to this system is an x ∈ P such that Ax ≥ (1 − ε)b.
We assume without loss of generality (by appropriately scaling the inequalities) that b i = 1 for all rows, so that now we desire to find an x ∈ P which satisfies the system within an additive ε factor.Since for all x ∈ P, we have Ax ≥ 0, and since all b i = 1, we conclude that for any i, A i x − b i ≥ −1.Thus, we assume that there is a (1, ρ)-bounded ORACLE for this problem.Now, applying Theorem 3.3, we get the following theorem.Theorem 3.6.Suppose there exists a (1, ρ)-bounded ORACLE for the program Ax ≥ 1 with x ∈ P. Given an error parameter ε > 0, there is an algorithm which computes a ε-approximate solution to the program, or correctly concludes that it is infeasible, using O(ρ log(m)/ε 2 ) calls to the ORACLE, plus an additional processing time of O(m) per call.

Fractional Packing Problems
A fractional packing problem can be written as where P is a convex domain such that Ax ≥ 0 for all x ∈ P, and b > 0. A ε-approximate solution to this system is an x ∈ P such that Ax ≤ (1 + ε)b.
Again, we assume that b i = 1 for all i, scaling the constraints if necessary.Now by rewriting this system as ∃?x ∈ P : −Ax ≥ −b we cast it in our general framework, and a solution x ∈ P which satisfies this up to an additive ε is a ε-approximate solution to the original system.Since for all x ∈ P, we have Ax ≥ 0, and since all b i = 1, we conclude that for any i, −A i x + b i ≤ 1.Thus, we assume that there is a (1, ρ)-bounded ORACLE for this problem.Now, applying Theorem 3.3, we get the following: Theorem 3.7.Suppose there exists a (1, ρ)-bounded ORACLE for the program −Ax ≥ −1 with x ∈ P.
Given an error parameter ε > 0, there is an algorithm which computes a ε-approximate solution to the program, or correctly concludes that it is infeasible, using O(ρ log(m)/ε 2 ) calls to the ORACLE, plus an additional processing time of O(m) per call.

Approximating multicommodity flow problems
Multicommodity flow problems are represented by packing/covering LPs and thus can be approximately solved using the PST framework outlined above.The resulting flow algorithm is outlined below together with a brief analysis.Unfortunately, the algorithm is not polynomial-time because its running time is bounded by a polynomial function of the edge capacities (as opposed to the logarithm of the capacities, which is the number of bits needed to represent them).Garg and Könemann [29,30] fixed this problem with a better algorithm whose running time does not depend upon the edge capacities.
Here we derive the Garg-Könemann algorithm using our general framework.This will highlight the essential new idea, namely, a reweighting of penalties to reduce the width parameter.Note that algorithm is not quite the same as in [29,30] (the termination condition is slightly different) and neither is the proof; the running time bound is the same however.
For illustrative purposes we focus on the maximum multicommodity flow problem.In this problem, we are given a graph G = (V, E) with capacities c e on edges, and set of k source-sink pairs of nodes.Let P be the set of all paths between the source-sink pairs.The objective is to maximize the total flow between these pairs.The LP formulation is as follows: ∀p ∈ P : f p ≥ 0 .
Here, the variable f p represents the flow on path p.Before presenting the Garg-Könemann idea we first present the algorithm one would obtain by applying our packing-covering framework (Section 3.3) in the obvious way.
First, note that by using binary search we can reduce the optimization problem to feasibility, by iteratively introducing a new constraint that gives a lower bound on the objective.So assume without loss of generality that we know the value F opt of the total flow in the optimum solution.Then we want to solve the following feasibility problem: where In this form, the feasibility problem given above is a packing LP, thus, we can apply the Multiplicative Weights algorithm of Section 3.3.4.
As outlined in Section 3.3, the obvious algorithm would maintain at each step t a weight w e (t) for each edge e.The ORACLE can be implemented by finding the flow in P which minimizes The optimal flow is supported on a single path, namely, the path p (t) ∈ P that has minimum length, when every edge e ∈ E is given length w e (t) /c e .Thus in every round we find this path p (t) and pass a flow F opt on this path.Note that the final flow will be an average of the flows in each event, and hence will also have value F opt .Costs for the edges are defined as in Section 3.3.
Unfortunately the width parameter is f p /c e = F opt /c min where c min is the capacity of the minimum capacity edge in the graph.The algorithm requires T = ρ ln(n)/ε 2 iterations to get an (1 − ε)-approximation to the optimal flow.The overall running time is Õ(F opt T sp /c min ) where T sp = Õ(mk) is the time needed to compute k shortest paths.As already mentioned, this is not polynomial-time since it depends upon 1/c min rather than the logarithm of this value.Now we describe the Garg-Könemann modification.We continue to maintain weights w e (t) for every edge e, where initially, w e (1) = 1 for all e.The costs are determined by flows as before, however, we consider a larger set of flows, viz., Note that we no longer need to use the value F opt .Again, in each round we choose a flow f ∈ P that is supported on the shortest path p (t) ∈ P with edge lengths w e /c e .
The main idea behind width reduction is the following: instead of routing the same flow F opt at each time step, we route only as much flow as is allowed by the minimum capacity edge on the path.In other words, at time t we route a flow of value c (t) on path p (t) , where c (t) is the minimum capacity of an edge on the path p (t) .The cost incurred by edge e is m e (t) = c (t) /c e .(In other words, a cost of 1/c e per unit of flow passing through e.)The width is therefore automatically upper bounded by 1.
We run the MW algorithm with η = ε/2.The update rule in this setting consists of updating the weights of all edges in path p (t) and leaving other weights unchanged at that step: The termination rule for the algorithm is to stop when as soon as for some edge e, the congestion f e /c e ≥ ln(m)/η 2 , where f e is the total amount of flow routed by the algorithm so far on edge e.

Analysis
We apply Theorem 2.5.Since we have m e (t) ∈ [0, 1] for all edges e and rounds t, we conclude that for any edge e, we have We now analyze both sides of this inequality.In round t, for any edge e, we have m e (t) = c (t) /c e if e ∈ p (t) , and 0 if e / ∈ p.Thus, we have where f e is the total amount of flow on e at the end of the algorithm, and ∑ e w e (t) The first inequality follows because for any edge e, we have ∑ p e f opt p ≤ c e .The second inequality follows from the fact that p is the shortest path with edge lengths given by w e /c e .Using this bound in (3.13), we get that where F = ∑ T t=1 c (t) is the total amount of flow passed by the algorithm.Plugging (3.12) and (3.14) into (3.11),we get that We stop the algorithm as soon as there we have an edge e with congestion f e /c e ≥ ln(m)/η 2 , so when the algorithm terminates we have Now, C := max e { f e /c e } is the maximum congestion of the flow passed by the algorithm.So, the flow scaled down by C respects all capacities.For this scaled down flow, we have that the total flow is which shows that the scaled-down flow is within Running time.In every iteration t of the algorithm, consider the minimum capacity edge e on the chosen path p (t) .It gets congested by the flow of value c (t) = c e sent in that round.Since we stop the algorithm as soon as the congestion on any edge is at least ln(m)/η 2 , any given edge can be the minimum capacity edge on the chosen path at most ln(m)/η 2 times in the entire run of the algorithm.Since there are m edges, the number of iterations is therefore at most m • ln(m)/η 2 = O(m log(m)/ε 2 ).Each iteration involves k shortest path computations.Recall that T sp is the time needed for this.Thus, the overall running time is O(T sp • m log m/ε 2 ).

O(log n)-approximation for many NP-hard problems
For many NP-hard problems, typically integer versions of packing-covering problems, one can compute a O(log n)-approximation by solving the obvious LP relaxation and then using Raghavan-Thompson [58] randomized rounding.This yields a randomized algorithm; to obtain a deterministic algorithm, derandomize it using Raghavan's [57] method of pessimistic estimators.
Young [65] has given an especially clear framework for understanding these algorithms which as a bonus also yields faster, combinatorial algorithms for approximating integer packing/covering programs.He observes that one can collapse the three ideas in the algorithm above-LP solving, randomized rounding, derandomization-into a single algorithm that uses the multiplicative update rule, and does not THEORY OF COMPUTING, Volume 8 (2012), pp.121-164 need to solve the LP relaxation directly.(Young's paper is titled "Randomized rounding without solving the linear program.") At the root of Young's algorithm is the observation that the analysis of randomized rounding uses the Chernoff-Hoeffding bounds.These bounds show that the sum of bounded independent random variables X 1 , X 2 , . . ., X n (which should be thought of as the random variables generated in the randomized rounding algorithm) is sharply concentrated about its mean, and are proved by applying Markov's inequality to the variable e η(∑ i X i ) for some parameter η.The key observation now is that the aforementioned application of Markov's inequality bounds the probability of failure (of the randomized rounding algorithm) away from 0 in terms of E[e η(∑ i X i ) ]. Thus, one can treat E[e η(∑ i X i ) ] as a pessimistic estimator (up to scaling by a constant) of the failure probability, and derandomization can be achieved by greedily (and hence, deterministically) choosing the X i sequentially to decrease this pessimistic estimator.The resulting algorithm is essentially the MW algorithm: in each round t, the deterministic part of the pessimistic estimator, viz.e η ∑ τ<t X τ , plays the role of the weight.
Below, we illustrate this idea using the canonical problem in this class, SET COVER.(A similar analysis works for other problems.)Since we have developed the multiplicative weights framework already, we do not detail Young's original intuition involving Chernoff bound arguments and can proceed directly to the algorithm.In fact, the algorithm can be simplified so it becomes exactly the classical greedy algorithm, and we obtain a ln n-approximation, which is best-possible for this problem up to constant factors (assuming reasonable complexity-theoretic conjectures [23]).
In the SET COVER problem, we are given a universe of n elements, say U = {1, 2, 3, . . ., n} and a collection C of subsets of U whose union equals U. We are required to pick the minimum number of sets from C which cover all of U. Let this minimum number be denoted OPT.The Greedy Algorithm picks subsets iteratively, each time choosing that set which covers the maximum number of uncovered elements.
We analyze the Greedy Algorithm in our setup as follows.Each element of the universe represents a constraint that the union of sets picked by the algorithm must cover it.Following the guidelines given in the beginning of Section 3, we cast the problem in our framework by letting decisions correspond to elements in the universe, and costs determined by sets C j ∈ C. The cost of the constraint corresponding to element i for a given set C j is 1 if i ∈ C j and 0 otherwise.
To translate the greedy algorithm to our framework, suppose we run the Multiplicative Weights Update algorithm with this setup with η = 1.Since the analysis in the proof of Theorem 2.1 technically requires η ≤ 1/2, in the following we repeat the same potential function analysis for the current setting for η = 1.For η = 1, the update rule w i (t+1) = w i (t) (1 − ηm i (t) ) implies that elements that have been covered so far have weight 0 while all the rest have weight 1.Thus, the distribution p (t) is simply the uniform distribution on the uncovered elements until time t.Since cost of an element for a given set is 1 if it is in the set and 0 otherwise, the set that maximizes the expected cost under p (t) is the one that maximizes the number of uncovered elements.The resulting algorithm is the Greedy Set Cover algorithm.
Since OPT sets cover all the elements, for any distribution p 1 , p 2 , . . ., p n on the elements, one set must cover at least 1/OPT fraction of elements.This implies that if we choose set C in round t that maximizes the number of uncovered elements, we have Following the analysis of Theorem 2.1, the change in potential for each round is: The strict inequality holds because m (t) • p (t) > 0 as long as there are uncovered elements.Thus, the potential drops by a factor of e −1/OPT every time.
We run this as long as some element has not yet been covered.We show that T = ln n OPT iterations suffice, which implies that we have a ln n approximation to OPT.We have Note that with η = 1, Φ (T +1) is exactly the number of elements left uncovered after T iterations.So we conclude that all elements are covered.

Learning theory and boosting
Boosting [60]-the process of combining several moderately accurate rules-of-thumb into a single highly accurate prediction rule-is a central idea in Machine Learning today.Freund and Schapire's AdaBoost [26] uses the Multiplicative Weights Update Rule and fits in our framework.Here we explain the main idea using some simplifying assumptions.
Let X be some set (domain) and suppose we are trying to learn an unknown function (concept) c : X → {0, 1} chosen from a concept class C. Given a sequence of training examples (x, c(x)) where x is generated from a fixed but unknown distribution D on the domain X, the learning algorithm is required to output a hypothesis h : X → {0, 1}.The error of the hypothesis is defined to be A strong learning algorithm is one that, for every distribution D, given ε, δ > 0 and access to random examples drawn from D, outputs with probability at least 1 − δ a hypothesis whose error is at most ε.Furthermore, the running time is required to be polynomial in 1/ε, 1/δ and othe relevant parameters.A γ-weak learning algorithm, for some given γ > 0, is an algorithm satisfying the same conditions but the error can be as high as 1/2 − γ.Boosting shows that if a γ-weak learning algorithm exists for a concept class, then a strong learning algorithm exists.(The running time of the algorithm and the number of samples may depend on γ.) We prove this result in the so-called boosting by sampling framework, which uses a fixed training set S of N examples drawn from the distribution D. The goal is to make sure that the final hypothesis erroneously classifies at most ε fraction of this training set.Using VC-dimension theory (see [26]) one can then show that if the weak learner produces hypotheses from a class H of bounded VC-dimension, and if N is chosen large enough (in terms of the error and confidence parameters, and the VC-dimension of the hypothesis class H), then with probability at least 1 − δ over the choice of the sample set, the error of the hypothesis over the entire domain X (under distribution D) is at most 2ε.
The idea in boosting is to repeatedly run the weak learning algorithm on different distributions defined on the fixed training set S. The final hypothesis has error ε under the uniform distribution on S. We run the MW algorithm with η = γ for T = (2/γ 2 ) ln(1/ε) rounds.The decisions correspond to samples in S and costs are specified by a hypothesis generated by the weak learning algorithm, in the following way.If hypothesis h is generated, the cost for decision point x is 1 or 0 depending on whether h(x) = c(x) or not.In other words, the cost vector m (indexed by x rather than i) is specified by m x = 1 − |h(x) − c(x)|.Intuitively, we want the weight of an example to increase if the hypothesis labels it incorrectly.
In each iteration, the algorithm presents the current distribution p (t) on the examples to the weak learning algorithm, and in return obtains a hypothesis h (t) whose error with respect to the distribution p (t)  is not more than 1/2 − γ, in other words, the expected cost in each iteration satisfies The algorithm is run for T rounds, where T will be specified shortly.The final hypothesis, h final , labels x ∈ X according to the majority vote among h (1) (x), h (2) (x), . . ., h (T ) (x).
Let E be the set of x ∈ S incorrectly labeled by h final .The total cost of each x ∈ E, since the majority vote gives an incorrect label for it.Instead of applying Theorem 2.1, we apply Theorem 2.4 which gives a more nuanced bound.The set P in this theorem is simply the set of all possible distributions on S. Choosing p to be the uniform distribution on E, we get 1 2 Since η = γ and T = (2/γ 2 ) ln(1/ε) , the above inequality implies that the fraction of errors, |E|/n, is at most ε as desired.

Hard-core sets and the XOR Lemma
A boolean function f : X → {0, 1}, where X is a finite domain, is γ-strongly hard, for circuits of size S if for every circuit C of size at most S, Here x ∈ X is drawn uniformly at random, and γ < 1/2 is a parameter.For some parameter ε > 0, it is ε-weakly hard for circuits of size S if for every circuit C of size at most S, we have Pr Now given f : {0, 1} n → {0, 1}, define f ⊕k : {0, 1} nk → {0, 1} to be the boolean function obtained by dividing up the input nk-bit string into k blocks of n bits each in the natural way, applying f to each block in turn, and taking the XOR of the k outputs.Yao's XOR Lemma [64] shows that if f is ε-weakly hard against circuits of size S then f ⊕k is γ + (1 − ε) k -strongly hard for circuits of size Sε 2 γ 2 /8.
The original proofs were difficult but Impagliazzo [40] suggested a simpler proof that as a byproduct proves an interesting fact about weakly hard functions: there is a reasonably large subset (at least ε fraction of X) of inputs on which the function behaves like a strongly hard function, for somewhat smaller circuit size S .This subset is called a hard-core set.For technical reasons, to prove such a result, it suffices to exhibit a "smooth" distribution p on X (precise definition given momentarily), such for any circuit C of size at most S , we have An ε-smooth distribution p (the same ε from the weak hardness assumption on f ) is one that doesn't assign too much probability to any single input: p x ≤ 1/(ε|X|) for any x ∈ X.Such a distribution be decomposed as a convex combination of probability distributions over subsets of size at least ε|X|.Klivans and Servedio [49] observed that Impagliazzo's proof is an application of a boosting algorithm.The argument is as follows.We are given a boolean function f that is ε-weakly hard for circuits of size S. Assume for the sake of contradiction f is not γ-strongly hard on any smooth distribution for circuits of some size S < S. Then for any smooth distribution, we can find a small circuit of size S that calculates f correctly with probability better than 1/2 + γ when inputs are drawn from the distribution.Treat this as a weak learning algorithm, and apply boosting.Boosting combines the small circuits of size S found by the weak learning algorithm into a larger circuit that calculates f correctly with probability at least 1 − ε on the uniform distribution on X, contradicting the fact f is ε-weakly hard, if we ensure that the size of the larger circuit is smaller than S.This can be done if the S is set to O(S/T ), where T is the number of boosting rounds.
With this insight, the problem now boils down to designing boosting algorithms that (a) are able to deal with smooth distributions on inputs and (b) have a small number of boosting rounds.The lower the number of boosting rounds, the better circuit size bound we get for showing γ-strong hardness.
The third author [45] has shown how to construct such a boosting algorithm using the MW algorithm for restricted distributions (see Section 2.2).This boosting algorithm obtains the best known parameters in hard-core set constructions directly without having to resort to composing two different boosting algorithms as in [49].This technique was extended in [8] to obtain uniform constructions of hard-core sets with the best known parameters.
We describe the boosting algorithm of [45] now.The main observation is that the set of all ε-smooth distributions is convex.Call this set P. Then, exactly as in Section 3.6, the boosting algorithm simply runs the MW algorithm, with the only difference being that the distributions it generates are restricted to be in P using relative entropy projections, as in the algorithm of Section 2.2.We can now apply the same analysis as in Section 3.6.Following this analysis, let E ⊆ X be the set of inputs on which the final majority hypotheses incorrectly computes f .Now we claim that |E| < ε|X|: otherwise, since the uniform distribution on E is ε-smooth, we obtain a contradiction for T = (2/γ 2 ) ln(1/ε) as before.Thus, the final majority hypothesis computes f correctly on at least a 1 − ε fraction of inputs from X.
This immediately implies the following hard-core set existence result, which has the best known parameters to date: Theorem 3.8.Given a function f : X → {0, 1} that is ε-weakly hard for circuits of size S, there is a subset of X of size at least ε|X| on which f is γ-strongly hard for circuits of size

Hannan's algorithm and multiplicative weights
Perhaps the earliest decision-making algorithm which attains bounds similar to the MW algorithm is Hannan's algorithm [34], dubbed "follow the perturbed leader" by Kalai and Vempala [43].This algorithm is given below.
For t = 1, 2, . . ., T : 1. Choose the decision which minimizes the total cost including random initial cost: where is the total cost so far for the ith decision.
2. Observe the costs of the decisions m (t) .In this section we show that for a particular choice of random initialization numbers {r i }, the algorithm above exactly reproduces the multiplicative weights algorithm, or more precisely the Hedge algorithm as in Section 2.1.This observation is due to Adam Kalai [42].Theorem 3.9 ( [42]).Let u 1 , . . ., u n be n independent random numbers chosen uniformly from [0, 1], and consider the algorithm above with r i = 1 η ln ln 1 u i .Then for any decision j, we have ∑ i e −ηL i (t) .
Proof.By monotonicity of the exponential function, we have: Where w i (t) = e −ηL i (t) and s i = ln 1 u i are independent exponentially distributed random variables with mean 1.The result now follows from Lemma 3.10 below.Integrating to remove the conditioning on s j , we have Pr ∀i = j : w j ds j = w j ∑ i w i .

Online convex optimization
Online convex optimization is a very general framework that can be applied to many of the problems discussed in the applications section and many more.Here "online" means that the algorithm does not know the entire input at the start, and the input is presented to it in pieces over many rounds.In this section we describe the framework and the central role of the multiplicative weights method.For a much more detailed treatment of online learning techniques see [16].
In online convex optimization, we move from a discrete decision set to a continuous one.Specifically, the set of decisions is a convex, compact set K ⊆ R n .In each round t = 1, 2, . .., the online algorithm is required to choose a decision, i. e., point p (t) ∈ K.A convex loss function f (t) is presented, and the decision maker incurs a loss of f (t) (p (t) ).The goal of the online algorithm A is to minimize loss compared to the best fixed offline strategy.This quantity is called regret in the game theory and machine learning literature.
The basic decision-making problem described in Section 2 with n discrete decisions is recovered as a special case of online convex optimization as follows.The convex set K is the n-dimensional simplex corresponding to the set of all distributions over the n decisions, and the payoff functions f (t) are defined as f (t) (p (t) ) = m (t) • p (t) given the cost vector m (t) .It also generalizes other online learning problems such as the online portfolio selection problem and online routing (see [35] for more discussion on applications).Zinkevich [66] gives algorithms for the general online convex optimization problem.More efficient algorithms for online convex optimization based on strong convexity of the loss functions have also been developed [37].
We now describe how to use the Multiplicative Weights algorithm to the online convex optimization problem for the special case where K is the n dimensional simplex of distributions on the coordinates.The advantage is that this algorithm has much better dependance on the dimension n than Zinkevich's algorithm (see [35] for more details).This algorithm has several applications such as in online portfolio selection [38].Here is the algorithm.First, define where ∇ f (t) (p) is the (sub)gradient of the function f (t) at point p.The parameter ρ is called the width of the problem.Then, run the standard Multiplicative Weights algorithm with η = ln(n)/T and the costs defined as where p (t) is the point played in round t.Note that for all t and all i, |m i (t) | ≤ 1 as required by the MW algorithm.
Theorem 3.11.After T rounds of applying the Multiplicative Weights algorithm to the online convex optimization framework, for any p ∈ K we have Proof.If f : K → R is a differentiable convex function, then for any two points p, q ∈ K we have where ∇ f (p) is the gradient of f at p. Rearranging we get Applying Corollary 2.2, we get that for any p ∈ K, since (from (3.17)) Substituting η = ln n/T completes the proof.[31], SDP has become an important tool in design of approximation algorithms for NP-hard optimization problems.Though this yields polynomial-time algorithms, their practicality is suspect because solving SDPs is a slow process.Therefore there is great interest in computing approximate solutions to SDPs, especially the types that arise in approximation algorithms.Since the SDPs in question are being used to design approximation algorithms anyway, it is permissible to compute approximate solutions to these SDPs.Klein and Lu [47] use the PST framework (Section 3.3) to derive a more efficient 0.878-approximation algorithm for MAX-CUT than the original SDP-based method in Goemans-Williamson [31].The main idea in Klein-Lu is to approximately solve the MAX-CUT SDP.However, their idea does work very well for other SDPs.The main issue is that the width ρ (see 3.3) is too high for certain SDPs of interest.
To be more precise, an SDP feasibility problem is given by: Here, we use the notation A • B = ∑ i j A i j B i j to denote the scalar product of two matrices thinking of them as vectors in R n 2 .The set P = {X ∈ R n×n | X 0, Tr(X) ≤ 1} is the set of all positive semidefinite matrices with trace bounded by 1.The Plotkin-Shmoys-Tardos framework (see Section 3.3) is suitable for approximating SDPs since all constraints are linear, and the oracle given in (3.7) can be implemented efficiently by an eigenvector computation.To see this, note that the oracle needs to decide, given a probability distribution p on the constraints, if there exists an X ∈ P such that ∑ j p j A j • X ≥ ∑ j p j b j .This can be implemented by solving the following optimization problem: It is easily checked that an optimal solution to the above optimization problem is given by the matrix X = vv where v is unit eigenvector corresponding to the largest eigenvalue of the matrix ∑ j p j A j .
The Klein-Lu approach was of limited uses in many cases because it does not do too well when the additive error ε is required to be small.(They were interested in the MAX-CUT problem, where this problem does not arise.The reason in a nutshell is that in a graph with m edges, the maximum cut has at least m/2 edges, so it suffices to compute the optimum to an additive error ε that is a fixed constant.)We have managed to extend the multiplicative weights framework to many of these settings to design efficient algorithms for SDP relaxations of many other problems.The main idea is to apply the Multiplicative Weights framework in a "nested" fashion: one can solve a constrained optimization problem by invoking the MW algorithm on an subset of constraints (the "outer" constraints) in the manner of Section 3.3, where the domain is now defined by the rest of the constraints (the "inner" constraints).The oracle can now be implemented by another application of the MW algorithm on the inner constraints.Alternatively, we can reduce the dependence on the width by using the observation that the Lagrangian relaxation problem on the inner constraints can be solved by the ellipsoid method.For details of this method, refer to our paper [3].For several families of SDPs we obtain the best running time known.
More recently Arora and Kale [5] have designed a new approach for solving SDPs that involves a variant of the multiplicative update rule at the matrix level; see Section 5 for details.
Yet another approach for approximately solving SDPs is by reducing the SDP into a maximization problem of a single concave function over the PSD cone.The latter problem can be approximated efficiently via an iterative greedy method.The resulting algorithm is extremely similar to the MW-based algorithm of [3], however its analysis is very different, see [36] for more details.

Approximating graph separators
A recent application of the multiplicative weights method is a combinatorial algorithm for approximating the SPARSEST CUT of graphs [4].This is a fundamental graph partitioning problem.Given a graph G = (V, E), the expansion of a cut (S, S) where S ⊆ V and S = V \ S, is defined to be Here, E(S, S) is the set of edges with one end point in S and the other in S. The SPARSEST CUT problem is to find the cut in the input graph of minimum expansion.This problem arises as a useful subroutine in many other algorithms, such as in divide-and-conquer algorithms for optimization problems on graphs, layout problems, clustering, etc.Furthermore, the expansion of a graph is a very useful way to quantify the connectivity of a graph and has many important applications in computer science.The work of Arora, Rao and Vazirani [7] gave the first O( √ log n) approximation algorithm to the SPARSEST CUT problem.However, their best algorithm relies on solving an SDP and runs in Õ(n 4.5 ) time.They also gave an alternative algorithm based on the notion of expander flows, which are multicommodity flows in the graph whose demand graph has high expansion.However, their algorithm was based on the ellipsoid method, and was thus quite inefficient.In the paper [4], we obtained a much more efficient algorithm for approximating the SPARSEST CUT problem to an O( √ log n) factor in Õ(n 2 ) time using the expander flow idea.The algorithm casts the problem of routing an expander flow in the graph as a linear program, and then checks the feasibility of the linear program using the techniques described in Section 3.3.The oracle for this purpose is implemented using a variety of techniques: the multicommodity flow algorithm of Garg and Könemann [29,30] (and its subsequent improvement by Fleischer [24]), eigenvalue computations, and graph sparsification algorithms of Benczúr and Karger [9] based on random sampling.

Multiplicative weight algorithms in geometry
The multiplicative weight idea has been used several times in computational geometry.Chazelle [19] (p. 6) describes the main idea, which is essentially the connection between derandomization of Chernoff bound arguments and the exponential potential function noted in Section 3.5.
The geometric applications consist of derandomizing a natural randomized algorithm by using a deterministic construction of some kind of small set cover or low-discrepancy set.Formally, the analysis is similar to our analysis of the Set Cover algorithm in Section 3.5.Clarkson used this idea to give a deterministic algorithm for Linear Programming [20,21].Following Clarkson, Brönnimann and Goodrich use similar methods to find Set Covers for hypergraphs with small VC dimension [11].
The MW algorithm was also used in the context of geometric embeddings of finite metric spaces, specifically, embedding negative-type metrics (i.e., a set of points in Euclidean space such that the squared Euclidean distance between them also forms a metric) into 1 .Such embeddings are important for approximating the important non-uniform SPARSEST CUT PROBLEM.
The approximation algorithm for SPARSEST CUT in Arora et al. [7] involves a "Structure Theorem."This structure theorem was interpreted by Chawla et al. [18] as saying that any n-point negative-type metric with maximum distance O(1) and average distance Ω(1) can be embedded into 1 such that average 1 distance is Ω(1/ √ log n).Then they used the MW algorithm to construct an embedding into 1 in which every pair of points that have negative-type metric distance Ω(1) have 1 distance that is off by at most an O( √ log n) factor of the original.Using a similar idea for other distances and combining the resulting embeddings they obtained an embedding of the negative-type metric into 1 in which every distance distorts by at most a factor O(log 3/4 n).Arora et al. [6] gave a more complicated construction to improve the distortion bound to O( log(n) log log n), leading to a O( log(n) log log n)-approximation for non-uniform sparsest cut.

Design of competitive online algorithms
Starting with the work of Alon, Awerbuch, Azar, Buchbinder and Naor [1], a number of competitive online algorithms have been developed using an elegant primal-dual approach which involves multiplicative weight updates.While the analysis of their algorithms seems to be beyond our general framework, we briefly mention this work without going into many details.We refer the readers to the survey by Buchbinder and Naor [14] for an extensive discussion of the topic.
Several online problems such as the ski rental problem, caching, load balancing, ad auctions, etc. can be cast (in their fractional form) as a linear program with non-negative coefficients in the constraints and cost, where either the constraints or variables arrive online one by one.The online problem is to maintain a feasible solution at all times with a bounded competitive ratio, i. e., ensuring that the cost of the solution maintained is bounded in terms of the cost of the optimal solution in each round.The main difficulty comes from the requirement that the solution maintained is monotonic in some sense (for example, the variables are never allowed to decrease).
Buchbinder and Naor [15] give a algorithm based on the primal-dual method that obtains good competitive ratios in this scenario.At the heart is a multiplicative weight update rule.Imagine we have a covering LP, i. e., all constraints are of the form a • x ≥ 1, where a has non-negative coefficients, and x is the vector of variables.The cost function has non-negative coefficients as well.Constraints arrive one at a time in each round and must be satisfied by the current solution.The requirement is that no variable can decrease from round to round.
In every round, the algorithm increases primal variables in the new constraint using multiplicative updates and the corresponding dual variable additively (so essentially, each primal variable is the exponential of the dual constraint that it corresponds to, much like the Plotkin-Shmoys-Tardos algorithms of Section 3.3).This is done until the constraint gets satisfied.Clearly, we maintain a feasible solution in each round, and the variables never decrease.The analysis goes by bounding the increase in the primal cost in terms of the dual cost (this is an easy consequence of the multiplicative update, in fact, the multiplicative update can be derived using this requirement).The competitive ratio is obtained by showing that the dual solution generated simultaneously, while infeasible, is not far from being feasible, i. e., scaling down the variables by a certain factor makes it feasible.This gives us a bound on the competitive ratio via weak duality.

Lower bound
Can our analysis of the Multiplicative Weights algorithm be improved?This section shows that the answer is no, not only for the MW algorithm itself, but for any algorithm which operates in the online setting we have described.A similar lower bound was obtained by Klein and Young [48].
Technically, we prove the following: Theorem 4.1.For any online decision-making algorithm with n ≥ 2 decisions, there exists a sequence of cost vectors m (1) , m (2) , . . ., m (T ) ∈ {0, 1} n such that min i ∑ t m i (t) = Ω(T ), and if the sequence of distributions over decisions produced by the algorithm are p (1) , p (2) , . . ., p (T ) , then we have Since in the theorem above we have min i ∑ T t=1 m i (t) = Ω(T ), Theorem 2.1 implies that by choosing the optimal value of η, viz.η = Θ log n/T we have Hence our analysis is tight up to constants in the additive error term.Moreover, the above lower bound applies to any algorithm, efficient or not.
Proof.The proof is via the probabilistic method.We construct a distribution over the costs so that the required bound is obtained in expectation.Interestingly, the distribution is independent of the actual algorithm used.We now specify the costs of the decisions.The cost of decision 1 is set to 1/2 for all rounds t.For any decision i > 1, we construct its cost via the following random process: in each iteration t, choose its cost to be either one or zero uniformly at random, i. e., m i (t) ∈ {0, 1} with probability of each outcome being 1/2.
The expected cost of each decision is 1/2.Hence, the expected cost of the chosen decision is also 1/2 irrespective of the algorithm's distribution p (t) , and hence: For every decision i, define X i = ∑ T t=1 m i (t) .Note that X 1 = T /2, and for i > 1, we have X i ∼ B(T, 1/2), where B(T, 1/2) is the binomial distribution with T trials and both outcomes equally likely.

The Matrix Multiplicative Weights algorithm
In the preceding sections, we considered an online decision making problem.We refer to that setting as the basic or scalar case.Now we briefly consider a different online decision making problem, which is seemingly quite different from the previous one, but has enough structure that we can obtain an analogous algorithm for it.We move from cost vectors to cost matrices, and from probability vectors to density matrices.For this reason, we refer to the current setting as the matrix case.We call the algorithm presented in this setting the Matrix Multiplicative Weights algorithm.The original motivation for our interest in this matrix setting is that it leads to a constructive version of SDP duality, just as the standard multiplicative weights algorithm can be viewed as a constructive version of LP duality.In fact the standard algorithm is a special subcase of the algorithm in this section, namely, when all the matrices involved are diagonal.Applications of the matrix multiplicative weights algorithm include solving SDPs [5], derandomizing constructions of expander graphs, and obtaining bounds on the sample complexity for a learning problem in quantum computing.The celebrated result of Jain et al. [41] showing that QIP =PSPACE relied on the Matrix MW algorithm.Here QIP is the set of all languages which have quantum interactive proofs.These applications are unfortunately beyond the scope of this survey; please see the third author's Ph.D. thesis [44] and Jain et al. [41] for details.The algorithm given here is from a paper of Arora and Kale [5].A very similar algorithm was discovered independently slightly earlier by Warmuth and Kuzmin [63], and is based on the even earlier work of Tsuda, Rätsch, and Warmuth [62].
We stick with our basic decision-making scenario but decisions now correspond to unit vectors v in S n−1 , the unit sphere in R n .As in the basic case, in every round, our task is to pick a decision v ∈ S n−1 .At this point, the costs of all decisions are revealed by nature.These costs are not arbitrary, but they are correlated in the following way.A cost matrix M ∈ R n×n is revealed, and the cost of a decision v is then v Mv.We assume that the costs of all decisions lie in [−1, 1].Again, as in the basic case, this is the only assumption we make on the way nature chooses the costs; indeed, the costs could even be chosen adversarially.Equivalently, we assume that all the eigenvalues of the matrix M are in the range [−1, 1].This game is repeated over a number of rounds.Let t = 1, 2, . . ., T denote the current round.In each round t, we select a distribution D (t) over the set of decisions S n−1 , and select a decision v randomly from it (and use his advised course of action).At this point, the costs of all decisions are revealed by nature via the cost matrix M (t) .The expected cost to the algorithm for choosing the distribution D (t) is

Figure 2 :
Figure 2: The Multiplicative Weights algorithm with Restricted Distributions.
/c e .

=
the optimum flow assigns f opt p flow to path p, and let F opt = ∑ p f opt p be the total flow.For any set of edge lengths w e /c e , the shortest path p ∈ P with these edge lengths satisfies ∑ e w e ∑ e∈p w e c e ≥ ∑ e w e • ∑ p e f F opt .THEORY OF COMPUTING, Volume 8 (2012), pp.121-164 THEORY OF COMPUTING, Volume 8 (2012), pp.121-164

T 4 ≤
n exp(−T /32) < 1 160 ln(n − 1)T , THEORY OF COMPUTING, Volume 8 (2012), pp.121-164 Lemma 3.10.Let w 1 , . . ., w n be arbitrary non-negative real numbers, and let s 1 , . . ., s n be independent exponential random variables with mean 1.Then The probability density function of s i is e −s i .Conditioning on the value of s j , for a particular i = j Since all variables s 1 , s 2 , . . ., s n are independent, we have Semidefinite programming (SDP) is a special case of convex programming.A semidefinite program is derived from a linear program by imposing the additional constraint that some subset of n 2 variables form an n × n positive semidefinite matrix.Since the work of Goemans and Williamson The first equality above is by the independence of the X i , and the first inequality is by Lemma 4.2.Thus, with probability at least 1 − 0.95 = 0.05, min i X i ≤ (T /2) −t.Since min i X i is always at least T /2, we get 2 /T n−1 ≤ e −1/15 < 0.95 .Since ∑ T t=1 m (t) • p (t) − min i ∑ T t=1 m i (t)≤ T , by Markov's inequality we conclude that Pr • E v∈D (t) [vv ] .THEORY OF COMPUTING, Volume 8 (2012), pp.121-164 SATYEN KALE received a B. Tech.degree from IIT Bombay and a Ph.D. from Princeton University in 2007.His Ph.D. advisor was Sanjeev Arora.His thesis focused on the topic of this paper: the Multiplicative Weights Update algorithm and its applications.After postdocs at Microsoft Research and Yahoo!Research, he continued his conquest of industrial research labs joining IBM Research in 2011.His current research is the design of efficient and practical algorithms for fundamental problems in Machine Learning and Optimization.