"A generalization of the Bradley-Terry model for draws in chess"

JEBO-D-20-00916

Referee Report


The referee's comments are preceded by a chevron (">") and my response
is included inline.  I have changed line-breaks in the interests of
clarity.

Overall Response

I am delighted that this insightful referee "gets" the manuscript.
The referee makes a number of observations and suggestions, all of
which I agree with, and I have made changes to the submission in
response, colour coded for ease of reference.  I have tried very hard
to preserve the brevity and pleasing structure that the referee
explicitly notes.  In short, the revised manuscript is a stronger,
better argued, and more widely applicable document and I commend it to
you.


Software implementation note

In the interests of full disclosure, I have made a significant change
to the C++ back end of the hyper2 package:

https://github.com/RobinHankin/hyper2/commit/d9d16c45daa752d4687446d7df9805439f461451#diff-25d902c24283ab8cfbac54dfa101ad31


The hyper2 package now uses player names as keys in the STL map class,
rather than integers.  The change is a conceptual streamlining which
results in simplified R idiom for a number of cases, but does not
change the mathematical basis of the software or alter any numerical
results.


Detailed responses


> This paper develops extensions to the Bradley-Terry model appropriate
> to Bernoulli trials so that the model can accommodate the
> chess-specific needs of draws and the first-mover advantage possessed
> by White.  The author takes this extended model to various sets of
> exclusively game-level outcomes and estimates time-invariant player
> strengths as well as the importance of draws and first-mover
> advantage.  After estimating common parameters for draws and
> first-mover advantage, the author allows for (time-invariant)
> player-specific parameters on these margins.  The paper then concludes
> by considering distinct draw parameters for Soviet and non-Soviet
> players.  Using data from the top chess events of 1962 and 1963, the
> author determines that Soviet players draw more frequently against
> each other than in other game scenarios.
> 
> The paper does a nice job of clearly laying out the various reified
> "monsters" that extend the original Bradley-Terry model to accommodate
> chess.  This framework is plausible and intuitive.  I also appreciated
> the paper's brevity in making its point. The allowance for
> playerspecific tendencies to draw or to gain excess-advantage from
> moving first also strikes me as an innovation to the literature.
> Lastly, the paper is well-written and a fun read.
> 
> 
> My comments on the rest of the paper are detailed below by decreasing
> importance.
> 
> 1) The author never properly explores the paper's identification
> strategy.  As I read it, the identifying assumption, besides the
> functional form assumption, is that player strengths are constant over
> the duration of the considered data. This seems highly plausible
> within a specific event (e.g., the Candidates tournament of 1963, the
> Interzonals of 1962), but much less so with the Anand-Karpov-Kasparov
> example.  The author does not specify the timespan over which these
> 365 games were played, but I imagine that the timespan must exceed 10
> years.  To the extent that all players have matured to full strength
> and none have yet begun to decline, this assumption would be benign,
> but it should be supported.


An excellent observation.  In the chess world the pre-eminent strength
measure would be the Elo rating in which the winning player takes
points from the loser based on the pre-match Elo rating difference.
Elo exhibits an emergent phenomenon in which sudden changes of ability
are not reflected instantly but have an exponentially damped response.

My view is that this damping feature is sometimes desirable (old games
are surely less informative than more recent ones) but uncontrolled in
the sense that the time-constant is not explicit or controllable.
Also, the Elo system uses an arbitrary points system with arbitrary
cut-offs [as opposed to the objective Bradley-Terry system] and has
the undesirable feature that changing the points system or the
cut-offs will change players' Elo ratings even if their playing
performance is unchanged.

But frankly I don't want to open the can of worms that would be
comparison between Elo and Bradley-Terry.  I had a protracted argument
on this issue with a chess enthusiast; let us say the discussion was
intense and robust.  The problem is that Elo was designed specifically
to downweight old data and this downweighting is what users of such
metrics want.  Unfortunately, what my correspondent wanted from
Bradley-Terry strengths was pretty much Elo with all its
arbitrariness: a somewhat circular position.

> 
> Precise estimation will hinge on a large number of games, but this
> will be in tension with the assumption of time-invariant
> player-strengths.  The author should explore this tension.
> 

I am in total agreement with this observation.  My thinking on this
issue has been strongly affected by my recent work applying
Bradley-Terry to Formula 1 motor racing [I have a manuscript under
review at JQAS].  In F1, drivers generally start their careers as weak
competitors, become stronger with experience, and finally become
uncompetitive due to age [although frankly Lewis Hamilton at 35yo
seems to buck this general trend, as recent events in Tuscany would
indicate].  I would suggest that chess players have a similar career
arc, although Formula 1 Teams typically take a dim view of
underperforming drivers---whom they rapidly and unceremonially drop.
The career arc for these competitors would have a truncated right hand
side, unlike chess players who typically decline gradually over
decades [although I would observe that Kasparov's current FIDE rating
of 2812 is not that much less than his peak of 2851].

Specifically, the dataset in question stretches from 1981 to 2017, a
very wide range.  Generalized Bradley-Terry, together with reified
entities, is capable of handling strengths varying over a career: one
would introduce a "mid-career monster", or perhaps an "experience
monster".  I would add here that monsters' strengths are inherently
positive, so one could deal with waning strength with age by
introducing a "youth monster" who helps younger players.  But such
analysis is, I would suggest, firmly in the category of further work
(although the revision does address a time-dependent component in the
form of a "rest monster").  I have added a discussion of the
identification issue to the manuscript (RED), in two parts: firstly
when introducing the dataset, and secondly under Further Work.


> 2) The author neglects to motivate the paper beyond its application
> within the chess world.  The tournament literature's applications lie
> primarily, though definitely not exclusively, within compensation and
> labor.  How could this extension be used within that context?  Where
> might economists see the first-mover advantage in action? Where might
> the extension to allow draws be especially lucrative? Etc.
>

(manuscript changes addressing this comment are marked in ORANGE).

The submission discusses applications to MasterChef Australia,
consumer behaviour and mating preference, but I was not aware of
tournament literature as a distinct academic field.  I have been
studying Lazear and Rosen, and Szymanksi 2003, and pondering how
reified monsters could help understanding tournaments in a workplace
remuneration context.  Lazear and Rosen supply a good rationale to
consider only rank in situations such as Formula 1 motor racing.

First-mover advantage is closely related to incumbent advantage,
something which had not occurred to me until now.  One of the papers
cited [Hankin 2010] uses a variant of Bradley-Terry to assess
incumbent advantage in biology (territory defense by lizards) and I
have added a more explicit pointer to this.  One point of contact with
tournament literature might be incumbent advantage possessed in a
remuneration tournament by candidates who are working temporarily in
the capacity being fought over, such as an 'acting CEO' or 'acting
head of department'.  The appropriate incumbent monster would furnish
a quantification of the incumbent advantage.

Returning to Lazear and Rosen, schemes that reward rank are common in
sports (notably Formula 1, discussed above, and sculling, both good
applications of the Plackett-Luce extension of Bradley-Terry).

Also, I have discussed Formula 1 motor racing above under part 1 of
the review.


> 3) As the author notes, the work of Moul and Nye (2009) is directly on
> point.  The author's discussion of this paper, however, is overly
> sparse and at times misleading.  Moul and Nye use existing observed
> player strengths (variations of commonly used scores based on the Elo
> rating system that goes back to 1960).  Moul and Nye then use these
> strengths within an ordered probit to estimate parameters that relate
> back to draw likelihood and White's first-mover advantage.  The sharp
> difference between that paper and this is then that the author is
> estimating player relative strengths (conditional upon a pool of
> players) while Moul and Nye use "observed" strengths [footnote: Those
> relative strengths are highly predictive, with t-statistics of around
> 20].  The latter approach is far less parametrically demanding, but
> leans much more on the assumption of normally distributed disturbances
> (basically the differing assumption of functional form).  The latter
> approach also allows individuals' player strength to vary across time
> and is far more flexible to changes in the composition of the pool of
> players.
> 
> Furthermore, Moul and Nye estimate several versions of their ordered
> probit without data on the number of moves (ply-count).  This is
> contrary to the implication of the paper (p. 6, immediately below
> Table 2), which suggests that the ply-count data are essential to that
> paper's findings.  All of that paper's core results (significance of
> draws, first-mover advantage, a distinct draw outcome for round-robin
> international tournament games that involve two Soviets) are
> originally derived without ply-count data.  The ply-count data are
> incorporated to explore a potential mechanism by which players in a
> draw-cartel might benefit.
>


This interpretation of Moul and Nye is very close to my own, and
re-reading my work I realise that the sentence in question was poorly
phrased and indeed clearly (but unintentionally) misleading.  I have
rephrased the sentence in question, and added a few more points of
reference to Moul and Nye, in BLUE.


> 4) The paper concludes that the estimates derived from the 1962
> Interzonals and the 1963 Candidates strongly suggest that the "draw
> monster" won games involving two Soviet players more frequently than
> games that involved at least one non-Soviet.  This is consistent with
> a Soviet cartel, but it could also be the result of Soviet-specific
> styles of play.  Moul and Nye (2009) test the Soviet style-of-play
> hypothesis by considering international tournaments when the format
> changed from all-play-all to knockout and all Soviet tournaments.
> Both comparisons yield results consistent with a Soviet draw-cartel in
> international all-play-all tournaments and rejecting the Soviet
> style-of-play hypothesis.  By only considering tournaments that were
> international and all-play-all, the author forgoes this critical
> source of identification.
> 
> In the cartel example, the author also neglects the advantage that
> presumably accrued to Soviet players when they played non-Soviet
> players in all-play-all international tournaments.  This would suggest
> another "monster" to be estimated, one that benefits the Soviet player
> when playing a non-Soviet opponent.  Without such an addition, Soviet
> strengths will be overstated.
> 

(comments addressing this comment in DARK GREEN) This is an excellent
suggestion, one which I feel nicely rounds out the analysis.  I have
included a "rest monster" whose strength aids players following a
collusive draw.  Reified Bradley-Terry allows one not only to
establish the existence of such a rest monster, but also to quantify
its strength and I have done just this in the revision, and compared
the findings with a classical McNemar's test.




