Better Rankings with Fewer Comparisons: Multi-Armed Bandits for Efficient Ordering

The Ranking Problem

You have 1,000 items. You need to rank them from best to worst. The only way to assess quality is to show two items to a human evaluator and ask which one is better. Each comparison costs time and money. How many comparisons do you need?

A naive approach — compare every pair — requires nearly 500,000 comparisons. That is obviously impractical. A sorting algorithm like merge sort would need about 10,000 comparisons, which is better but still expensive if each comparison costs a dollar or takes five minutes of expert time. And these estimates assume that human judgments are perfectly consistent, which they are not. The same evaluator shown the same pair twice might give different answers 10-20% of the time.

This problem appears everywhere. Search engines need to rank results by relevance. Recommendation systems need to order items by predicted preference. Hiring committees need to rank candidates. Academic conferences need to rank paper submissions. Tournament organisers need to seed competitors. In all of these settings, the fundamental challenge is the same: produce a good ranking with as few comparisons as possible, given that comparisons are noisy.

This is the problem that compere addresses.

Why Sorting Algorithms Fall Short

Traditional sorting algorithms assume two things that do not hold in practice. First, they assume that comparisons are deterministic — that asking “is A better than B?” always yields the same answer. Human judgments violate this assumption routinely. Second, they assume that you need a complete ranking — that the position of every single item matters equally. In practice, getting the top 10 right is usually far more important than correctly ordering items 487 and 488.

These two observations point toward a different algorithmic framework. We need an approach that handles noisy comparisons gracefully, that focuses its effort where it matters most, and that produces a good-enough ranking as quickly as possible rather than a perfect ranking eventually.

Multi-armed bandits provide exactly this framework.

The Multi-Armed Bandit Connection

The multi-armed bandit (MAB) problem is a classic framework in sequential decision-making. You have several slot machines (arms), each with an unknown payout distribution. At each time step, you pull one arm and observe the reward. Your goal is to maximise total reward over time, which requires balancing exploration (trying arms to learn their payouts) with exploitation (pulling the arm you currently believe is best).

The connection to ranking becomes clear when you reframe the problem. Instead of slot machines, you have items. Instead of pulling an arm, you select a pair of items to compare. Instead of observing a reward, you observe which item wins the comparison. Your goal is to identify the ranking as quickly as possible, which requires balancing exploration (comparing items you are uncertain about) with exploitation (focusing comparisons on the items whose positions you care about most).

This reframing transforms ranking from a sorting problem into a sequential experimental design problem. And sequential experimental design is exactly what bandit algorithms are built for.

How compere Works

compere implements a family of bandit-based ranking algorithms. The core idea is straightforward, even if the mathematics behind it is nontrivial.

Maintaining beliefs. For each item, compere maintains a probabilistic estimate of its quality. Initially, these estimates are uniform — every item could be anywhere in the ranking. As comparisons are observed, the estimates are updated using Bayesian inference. Items that win comparisons have their estimated quality pushed upward; items that lose have theirs pushed downward.

Selecting comparisons. At each step, compere must choose which pair of items to compare next. This is where the bandit logic operates. A naive strategy would select pairs randomly. A greedy strategy would compare the items whose current estimates are closest (the most uncertain boundary). compere uses a strategy that considers both uncertainty and importance: it preferentially selects comparisons that are most likely to change the ranking in positions that matter.

For example, if your application cares primarily about the top 10, compere will focus comparisons on items that are near the 10th-position boundary. Items that are clearly in the top 5 or clearly outside the top 50 will receive fewer comparisons, because additional information about them is unlikely to change the decisions you care about.

Convergence. As comparisons accumulate, the probabilistic estimates narrow. Items settle into their positions. compere can report a ranking at any time, along with confidence intervals for each item’s position. The user decides when the ranking is good enough to stop — they are not forced to wait until some algorithm-determined termination condition.

The Explore-Exploit Trade-Off in Ranking

The explore-exploit trade-off manifests concretely in ranking. Consider a scenario where you have 500 items and care about the top 20.

Early on, compere is in exploration mode. It samples comparisons broadly across the population to get initial estimates for every item. Many items are quickly identified as clearly below the top-20 threshold, and compere effectively eliminates them from further consideration.

As the process continues, compere shifts toward exploitation. It concentrates comparisons on the 40-60 items that are plausible top-20 candidates. Within this group, it focuses on the most uncertain boundaries — the items hovering around positions 18-22, where a few more comparisons could change who makes the cut.

This adaptive allocation of comparisons is what makes compere dramatically more efficient than static approaches. In our benchmarks on synthetic datasets with known ground-truth rankings, compere identifies the correct top-20 with 95% accuracy using 60-70% fewer comparisons than merge sort with repeated comparisons, and 85-90% fewer than exhaustive pairwise comparison.

Handling Noisy Judgments

Human comparisons are noisy. The same evaluator might prefer item A over item B today and reverse their judgment tomorrow. Different evaluators might disagree. The strength of preference matters — choosing between two excellent items is harder and noisier than choosing between an excellent item and a terrible one.

compere models this noise explicitly. Each comparison is treated as a noisy observation drawn from a probabilistic model of comparative judgment (specifically, a variant of the Bradley-Terry model). The noise level is not assumed — it is estimated from the data. When compere detects high disagreement on a particular pair, it either requests additional comparisons for that pair or reduces its confidence in the relative ordering of those items.

This noise-aware approach has a practical consequence: compere’s rankings come with honest uncertainty estimates. Rather than asserting that item A is ranked 7th, compere might report that item A is ranked 7th with a 90% confidence interval of [5, 11]. This is more informative than a point estimate, and it lets downstream decision-makers calibrate their actions to the actual precision of the ranking.

Applications

Search ranking evaluation. Search engines need to evaluate ranking quality, which requires human relevance judgments. A standard evaluation campaign might need relevance judgments for 10,000 query-document pairs. compere can identify the best ranking function with far fewer judgments by adaptively focusing evaluations on the queries and documents where current ranking functions disagree.

Recommendation system tuning. Recommendation algorithms have many hyperparameters. Evaluating each configuration requires user feedback, which is expensive. compere can treat each configuration as an item to be ranked and efficiently identify the best configuration with minimal user testing.

Tournament design. In competitive settings — esports, academic competitions, hiring processes — you want to identify the best participants with as few rounds as possible. compere’s algorithms can design tournament brackets that are more informative than traditional round-robin or single-elimination formats, particularly when the number of participants is large.

Content moderation and policy. Platforms that need to rank the severity of content violations can use compere to train ranking models from expert judgments. Rather than having experts label every piece of content, compere identifies the comparisons that are most informative for calibrating the severity scale.

Practical Considerations

compere is designed to be used by people who are not experts in bandit algorithms. The interface is simple: provide items, provide a way to compare them (human evaluators, automated metrics, or a combination), specify what part of the ranking matters most, and compere handles the rest.

The computational overhead of compere itself is negligible — the bottleneck is always the comparisons, not the algorithm’s decision-making. compere can manage rankings of up to 100,000 items on modest hardware, though the number of comparisons required grows with the number of items, the desired precision, and the noise level.

One limitation worth noting: compere assumes that item quality is static during the ranking process. If items change while they are being evaluated — a product is updated, a candidate gains new experience, a search algorithm is retrained — the ranking may be inconsistent. For dynamic settings, we are exploring extensions that discount older comparisons, though these are not yet in the production release.

Why This Matters

Ranking is one of those problems that seems simple until you try to do it well at scale. The combination of noise, cost, and the need for adaptive precision makes it a poor fit for classical sorting and a natural fit for sequential decision-making under uncertainty.

compere brings principled bandit algorithms to this problem in a form that is practical and accessible. The result is rankings that are better (because comparisons are allocated where they matter), cheaper (because fewer comparisons are needed), and more honest (because they come with calibrated uncertainty). For any organisation that spends significant resources on evaluation and ranking — and that is most organisations — this represents a meaningful improvement in how that budget is spent.