Balance Works
Evidence Explainer

Relative Likelihood Ratios &
Simpson's Paradox

Understanding how statistics reveal — and sometimes conceal — inequality in workplace and policing data.

01 — Foundation

Likelihood Ratios vs Relative Likelihood Ratios

These two terms are often used interchangeably — but they mean different things. Getting the distinction right matters, particularly when presenting evidence in professional or legal contexts.

Likelihood Ratio (LR)

A Likelihood Ratio expresses the probability of an outcome for a given group — specifically, the ratio of outcomes to the population at risk. It is the group's rate.

LRA  =  Outcomes for Group A Population of Group A
e.g. 38 stops ÷ 1,000 Black people = LR of 0.038

Each group has its own LR. On its own, it tells you how common the outcome is within that group — but not how that compares to anyone else.

Relative Likelihood Ratio (RLR)

The Relative Likelihood Ratio is the comparison between two groups' likelihood ratios. It answers the question: how many times more likely is this outcome for Group A than for the reference group?

RLR  =  LRA  (focus group) LRB  (reference group)  =  OutcomesA ÷ PopulationA OutcomesB ÷ PopulationB
e.g. 0.038 ÷ 0.007 = RLR of 5.4 — a Black person is 5.4× more likely to be stopped than a White person

The reference group (Group B) is typically the majority or comparator group — e.g. White British, Male, Non-disabled. The RLR is the number that appears in equality reports, tribunal evidence, and PSED monitoring data.

💡
In short: LRs are the inputs; the RLR is the output. You calculate a likelihood ratio for each group, then divide one by the other to get the relative likelihood ratio. The RLR is the figure that quantifies disparity.

Reading the RLR — Direction Matters

An RLR can fall above or below 1.0, and whether that is a problem depends entirely on the outcome being measured. For a negative outcome (disciplinary, use of force, dismissal), a high RLR signals the focus group is disadvantaged. For a positive outcome (promotion, training access, commendations), a low RLR signals the same thing. Always ask: which direction represents harm for this group?

For NEGATIVE outcomes (e.g. stop & search, disciplinary, use of force)
RLR Value What It Means Concern Level
< 0.80 Focus group substantially less likely — worth checking; may reflect under-recording or genuine under-use Note
0.80 – 1.25 Within the four-fifths rule tolerance — broadly proportionate (see note below) Acceptable range
1.25 – 1.5 Exceeds four-fifths threshold — adverse impact identified, warrants investigation Investigate
1.5 – 2.0 50–100% more likely — moderate to significant disparity Significant
2.0 – 3.0 2–3× more likely — serious disparity, action required Serious
3.0+ 3× or more likely — severe disparity, urgent review needed Critical
For POSITIVE outcomes (e.g. promotion, training access, appointment)
RLR Value What It Means Concern Level
> 1.25 Focus group more likely to access the positive outcome — may indicate over-representation worth monitoring Note
0.80 – 1.25 Within the four-fifths rule tolerance — broadly proportionate Acceptable range
0.50 – 0.80 Exceeds four-fifths threshold — focus group substantially less likely to benefit; adverse impact identified Significant
< 0.50 Focus group less than half as likely to access the positive outcome — serious barrier indicated Serious

⚖️ The Four-Fifths Rule (0.80 / 1.25)

The most widely used rubric in UK workforce equality monitoring is the four-fifths rule, formalised in the NHS Workforce Race Equality Standard (WRES). It defines adverse impact as any RLR below 0.80 or above 1.25. Note that 1.25 = 1 ÷ 0.80 — the thresholds are reciprocals, making them symmetrical around 1.0.

These are rules of thumb, not legal thresholds. An RLR of 1.24 is not automatically safe, and an RLR of 1.26 is not automatically unlawful. What the rubric does is provide a consistent, defensible trigger for further investigation — and a common language across organisations.

Different sectors use slightly different thresholds (some policing contexts use 2.0 as the trigger for formal scrutiny), so always clarify which rubric applies in your context.

📊 Statistical Significance: Don't Overlook Small Numbers

A large RLR can be misleading when the underlying numbers are small. If a group has only 10 members and 3 face a particular outcome, the RLR may appear alarming — but with such a small sample, the result is highly volatile and unreliable. A different year, or a single additional case, could change the figure dramatically.

This is where chi-square (χ²) analysis is valuable. A chi-square test assesses whether the difference between observed and expected frequencies across groups is statistically significant — i.e. whether it is likely to reflect a real pattern rather than random variation.

Best practice is to report both the RLR and the statistical significance: an RLR of 2.0 that is not statistically significant (perhaps because n=15 in one cell) tells a very different story to an RLR of 1.4 that is highly significant across 500 cases. Chi-square is accessible in practice — it can be calculated directly in Excel using CHISQ.TEST on a simple 2×2 table of outcomes vs non-outcomes for each group. Some statistical software will also produce confidence intervals around the RLR, which communicate the same uncertainty more explicitly if needed.

💡
The practitioner's test: An RLR opens an investigation, it doesn't conclude one. The question is always whether a disparity — in either direction — can be explained by a lawful, proportionate factor. If it cannot, the burden of justification increases with the size and significance of the ratio.
02 — Applied

Equality Examples

Relative Likelihood Ratios appear across the full spectrum of equality data — from policing outcomes to workplace progression. Here are three common contexts.

Misconduct & Disciplinary Proceedings

In organisations, Relative Likelihood Ratios can reveal whether employees from certain groups face disciplinary action at disproportionate rates — even when controlling for performance or seniority.

Illustrative example: disciplinary rates in a 500-person organisation
Group Employee Count Disciplinaries Rate per 100 RLR
White 320 16 5.0% 1.0 (reference LR)
Asian 80 6 7.5% 1.5×
Black 60 7 11.7% 2.3×
Mixed / Other 40 4 10.0% 2.0×
⚖️
A Black employee is 2.3× more likely to face disciplinary action than a White employee in this example — an RLR of 2.3. Under the PSED (Public Sector Equality Duty), public bodies must publish and act on this type of data. An RLR above 2.0 in a workforce context is a significant red flag.

Promotion & Progression

RLRs can be applied in reverse — when the outcome is positive (e.g. promotion), an RLR below 1.0 for a group means they are less likely to progress, revealing potential barriers.

Promotion success rates: 200 applicants over 3 years
Group Applied Promoted Success Rate RLR
Male (non-disabled) 80 32 40% 1.0 (reference LR)
Female (non-disabled) 70 21 30% 0.75×
Male (disabled) 30 9 30% 0.75×
Female (disabled) 20 4 20% 0.50×
🔍
A disabled woman has an RLR of 0.50 — she is only half as likely to be promoted as a non-disabled man. This is an intersectional finding — neither gender nor disability alone fully explains the disparity. RLRs applied intersectionally are a powerful tool for revealing compounding disadvantage.
03 — Interactive

Relative Likelihood Ratio Calculator

Enter the outcome counts and population sizes for two groups. The calculator derives each group's Likelihood Ratio, then divides them to produce the Relative Likelihood Ratio.

Group A (Focus Group)

Group B (Reference Group)

5.4×
Relative Likelihood Ratio (RLR)
A Black person is 5.4× more likely to experience this outcome than a White person.
04 — Advanced

Simpson's Paradox

A statistical phenomenon where a trend that appears in aggregated data reverses — or disappears — when the data is broken down into subgroups. In equality work, it can both hide and reveal discrimination.

The Classic Problem

Imagine an organisation where, in aggregate, women appear to be promoted at a much lower rate than men. Does this mean women are being treated unfairly within the organisation's promotion processes? Not necessarily — and the answer matters, because it determines the right intervention.

If women are concentrated in departments with lower overall promotion rates, and men are concentrated in departments with higher overall promotion rates, the aggregate can make the overall picture look worse than what is happening in any individual department — even if women are actually favoured within every department.

Worked Example: Promotion in a Two-Department Organisation
Department Gender Applied Promoted Rate RLR (W vs M)
Operations
(lower promo rate)
Men 10 1 10% 1.50 ✓
Women favoured
Women 40 6 15%
Strategy
(higher promo rate)
Men 40 20 50% 1.20 ✓
Women favoured
Women 10 6 60%
All combined Men 50 21 42% 0.57 ✗
Women appear disadvantaged
Women 50 12 24%
🔄
Within each department, women are promoted at a higher rate than men (RLR 1.50 and 1.20). Yet in the aggregate, women appear to be promoted at less than half the rate of men (RLR 0.57). The paradox arises because most women are in Operations — where promotion is rare for everyone — while most men are in Strategy, where promotion is common. The problem is not how women are treated within departments; it is how they are distributed across them. That is a structural barrier, not a departmental one — and it calls for a different intervention.

Interactive Demonstration: Gender Pay Gap in Policing

In policing, officers are generally paid more than police staff — and officers are more likely to be men, while staff are more likely to be women. Toggle between views to see how this shapes the headline pay gap figure.

Overall median hourly pay — all officers and staff combined (illustrative)
Men
£19.00/hr
£19.00
Women
£14.50/hr
£14.50
Gender Pay Gap: 23.7%
👁️
The aggregate shows a 23.7% gender pay gap. At face value it appears women are paid substantially less. But does this reflect pay discrimination within roles — or something about the composition of the workforce? Switch to disaggregated view →

📅 A Real-World Twist: Year-on-Year, Everything Got Worse — Yet the Headline Improved

The officers/staff dynamic creates a further paradox when tracked over time. In the following illustrative example — based on a real pattern observed in force-level GPG reporting — the pay gap worsened within both officers and staff year-on-year, yet the aggregate headline figure improved.

Year 1 — Baseline
GroupHeadcountMen avg payWomen avg payGPG
Officers300 (75% M, 25% F)£22.00£21.004.5%
Staff200 (30% M, 70% F)£14.00£13.007.1%
All combined500£20.32£15.7922.3%
Year 2 — 30 new female officers recruited at entry-level pay (£19.00)
GroupHeadcountMen avg payWomen avg payGPG
Officers330 (68% M, 32% F)£22.50£20.79 ↓7.6% ↑ worse
Staff200 (30% M, 70% F)£14.50£13.209.0% ↑ worse
All combined530£20.82£16.4521.0% ↓ better

How Can This Be True?

The officer pay gap widened because 30 new women joined at entry-level officer pay (£19.00), pulling the average women's officer pay down — even though each individual woman is paid in line with her male equivalent at the same point on the scale. The staff gap widened for separate reasons. Yet the aggregate gap narrowed, because those 30 new female officers — despite being at the bottom of the officer pay scale — still earn more than most staff. Their arrival shifted the overall composition of the female workforce upward, improving the aggregate figure while masking deterioration in both underlying groups.

A headline improvement can therefore coexist with — and actively conceal — worsening conditions in every subgroup. This is precisely why aggregate GPG reporting, mandated under the Equality Act 2010, tells only part of the story.

📌
The diagnostic question: Is the aggregate figure improving because conditions are genuinely improving — or because the composition of the workforce is changing? Only disaggregation can tell you which.

Simpson's Paradox in Policing: Use of Force

The same paradox can appear in use-of-force data. The key is understanding not just the rates, but the volumes behind them — specifically, which groups are concentrated in which encounter types.

In this illustrative example, vehicle stops involve higher force rates for everyone, and Black people are concentrated in foot patrol; White people are concentrated in vehicle stops.

Context Group Encounters Force used Rate (LR) RLR
Foot patrol
lower force rate type
White 200 10 5.0% 1.8×
Black higher
Black 800 72 9.0%
Vehicle stops
higher force rate type
White 800 120 15.0% 1.33×
Black higher
Black 200 40 20.0%
All combined White 1,000 130 13.0% 0.86×
Black appears LOWER
Black 1,000 112 11.2%
🔍
Why does this happen? Black people have 800 of their 1,000 encounters in foot patrol — the lower-force type — while White people have 800 of their 1,000 encounters in vehicle stops, where force is used more often by everyone. Black people experience more force within each encounter type (RLRs of 1.8× and 1.33×), but their concentration in the lower-force context pulls their aggregate rate down below White. The aggregate RLR of 0.86 — apparently reassuring — actively conceals a consistent pattern of higher force use against Black people in every context.
⚠️
Aggregate RLRs can actively mislead. Always show the volumes. A rate without a denominator — and without the distribution of that denominator across subgroups — is an incomplete picture.
05 — Summary

Key Takeaways

What practitioners need to know when working with relative likelihood ratios and aggregated equality data.

✅ On Relative Likelihood Ratios

An RLR quantifies disparity — it is the ratio of two groups' Likelihood Ratios, telling you how many times more (or less) likely an outcome is for one group compared to a reference group.

It is a starting point, not a verdict. An RLR of 3.0 demands explanation. It can be lawfully justified by a proportionate means to a legitimate aim — but the burden to demonstrate this increases with the size of the ratio.

Intersectionality matters. Run RLRs for combinations of protected characteristics, not just single-axis comparisons. The greatest disparities often appear at the intersections.

RLRs below 1.0 matter too. When the outcome is positive (promotion, development, pay), an RLR below 1.0 signals under-representation, not safety.

⚠️ On Simpson's Paradox

Never trust aggregate-only data in equality analysis. Always ask: what confounding variable might be lurking in the structure of the data?

Common confounders in equality data include: job level/grade, length of service, department, geographic location, shift pattern, and contract type.

The paradox can work both ways. It can make discrimination invisible (the GPG example), or it can make a neutral system appear discriminatory. Both distortions are dangerous.

Causation vs. composition. Always ask: is this gap because of how people are treated within groups, or because of how groups are distributed across categories? The answer determines the intervention.

The Practitioner's Rule

Report both the aggregate and the disaggregated figures. When they tell different stories, that difference is itself the finding — and often the most important one.

Figures used in this explainer are illustrative and based on published data patterns. They are intended to demonstrate analytical concepts, not to represent exact current statistics.