Testing Ranking System Performance
Whether we like it or not, our World Ranking system is becoming more and more integrated into our croquet lives. It is not surprising that questions about its performance repeatedly surface in discussions on the Nottingham Board. Will Step size 60 perform better than 40? Will a larger Smoothing parameter give better performance? Such questions remain largely unanswered. One reason is that there are no known criteria for what "better performance" means.
The purpose of this article is to initiate the development of performance testing. It involves the following steps.
Setting up a Test Population
Setting up a Trial System (as rival for the Current System)
Setting up a Test
Studying the Test Results
We illustrate the procedure by carrying out these steps one by one. Here is a quick peek at what lies ahead.
Our first illustrative Test involves a Test Population created so that all players start with randomly selected Grades and Indexes within 200 of what the true value should be. The idea is to see which system will recover better from such a deliberately bad start -- a start that gave a Standard Deviation of about 115. At the end of the experiment, we find that the current system had improved the quality of the Grades and Indexes quite nicely. For the 30+ population (i.e. at least 30 games played) it yielded Standard Deviations of 63.0 (for Grades) and 75.3 for Indexes -- well down from the 115 at the start. The Trial System under scrutiny did even better: 59.4 (for Grades) and 60.2 (for Indexes). Note that the Indexes of the Trial System outperformed even the Grades of the Current System. However, that is not the end of the story. In a similar second experiment in which starting Grades and Indexes need only be within 400 of the true value, the Current System flexed some muscle. Its Grades performed better than those of the Trial System; its Indexes only where few games were played. It seems fair to say that the Testing Procedure envisaged will not end all arguments. Hopefully, the arguments that remain will at least become more focused and better informed. For example, as regards the mentioned experiments, the arguments might become focused on whether the system normally operates within 200 of the target or whether the wider margin of 400 is the more realistic one. And, of course, what supplementary Tests are needed in order to arrive at a well founded decision.