Relative Likelihood Ratios & Simpson's Paradox

01 — Foundation

Likelihood Ratios vs Relative Likelihood Ratios

These two terms are often used interchangeably — but they mean different things. Getting the distinction right matters, particularly when presenting evidence in professional or legal contexts.

Likelihood Ratio (LR)

A Likelihood Ratio expresses the probability of an outcome for a given group — specifically, the ratio of outcomes to the population at risk. It is the group's rate.

LR_A = Outcomes for Group A Population of Group A

e.g. 38 stops ÷ 1,000 Black people = LR of 0.038

Each group has its own LR. On its own, it tells you how common the outcome is within that group — but not how that compares to anyone else.

Relative Likelihood Ratio (RLR)

The Relative Likelihood Ratio is the comparison between two groups' likelihood ratios. It answers the question: how many times more likely is this outcome for Group A than for the reference group?

RLR = LR_A (focus group) LR_B (reference group) = Outcomes_A ÷ Population_A Outcomes_B ÷ Population_B

e.g. 0.038 ÷ 0.007 = RLR of 5.4 — a Black person is 5.4× more likely to be stopped than a White person

The reference group (Group B) is typically the majority or comparator group — e.g. White British, Male, Non-disabled. The RLR is the number that appears in equality reports, tribunal evidence, and PSED monitoring data.

💡

In short: LRs are the inputs; the RLR is the output. You calculate a likelihood ratio for each group, then divide one by the other to get the relative likelihood ratio. The RLR is the figure that quantifies disparity.

Reading the RLR — Direction Matters

An RLR can fall above or below 1.0, and whether that is a problem depends entirely on the outcome being measured. For a negative outcome (disciplinary, use of force, dismissal), a high RLR signals the focus group is disadvantaged. For a positive outcome (promotion, training access, commendations), a low RLR signals the same thing. Always ask: which direction represents harm for this group?

For NEGATIVE outcomes (e.g. stop & search, disciplinary, use of force)

RLR Value	What It Means	Concern Level
< 0.80	Focus group substantially less likely — worth checking; may reflect under-recording or genuine under-use	Note
0.80 – 1.25	Within the four-fifths rule tolerance — broadly proportionate (see note below)	Acceptable range
1.25 – 1.5	Exceeds four-fifths threshold — adverse impact identified, warrants investigation	Investigate
1.5 – 2.0	50–100% more likely — moderate to significant disparity	Significant
2.0 – 3.0	2–3× more likely — serious disparity, action required	Serious
3.0+	3× or more likely — severe disparity, urgent review needed	Critical

For POSITIVE outcomes (e.g. promotion, training access, appointment)

RLR Value	What It Means	Concern Level
> 1.25	Focus group more likely to access the positive outcome — may indicate over-representation worth monitoring	Note
0.80 – 1.25	Within the four-fifths rule tolerance — broadly proportionate	Acceptable range
0.50 – 0.80	Exceeds four-fifths threshold — focus group substantially less likely to benefit; adverse impact identified	Significant
< 0.50	Focus group less than half as likely to access the positive outcome — serious barrier indicated	Serious

⚖️ The Four-Fifths Rule (0.80 / 1.25)

The most widely used rubric in UK workforce equality monitoring is the four-fifths rule, formalised in the NHS Workforce Race Equality Standard (WRES). It defines adverse impact as any RLR below 0.80 or above 1.25. Note that 1.25 = 1 ÷ 0.80 — the thresholds are reciprocals, making them symmetrical around 1.0.

These are rules of thumb, not legal thresholds. An RLR of 1.24 is not automatically safe, and an RLR of 1.26 is not automatically unlawful. What the rubric does is provide a consistent, defensible trigger for further investigation — and a common language across organisations.

Different sectors use slightly different thresholds (some policing contexts use 2.0 as the trigger for formal scrutiny), so always clarify which rubric applies in your context.

📊 Statistical Significance: Don't Overlook Small Numbers

A large RLR can be misleading when the underlying numbers are small. If a group has only 10 members and 3 face a particular outcome, the RLR may appear alarming — but with such a small sample, the result is highly volatile and unreliable. A different year, or a single additional case, could change the figure dramatically.

This is where chi-square (χ²) analysis is valuable. A chi-square test assesses whether the difference between observed and expected frequencies across groups is statistically significant — i.e. whether it is likely to reflect a real pattern rather than random variation.

Best practice is to report both the RLR and the statistical significance: an RLR of 2.0 that is not statistically significant (perhaps because n=15 in one cell) tells a very different story to an RLR of 1.4 that is highly significant across 500 cases. Chi-square is accessible in practice — it can be calculated directly in Excel using CHISQ.TEST on a simple 2×2 table of outcomes vs non-outcomes for each group. Some statistical software will also produce confidence intervals around the RLR, which communicate the same uncertainty more explicitly if needed.

💡

The practitioner's test: An RLR opens an investigation, it doesn't conclude one. The question is always whether a disparity — in either direction — can be explained by a lawful, proportionate factor. If it cannot, the burden of justification increases with the size and significance of the ratio.

02 — Applied

Equality Examples

Relative Likelihood Ratios appear across the full spectrum of equality data — from policing outcomes to workplace progression. Here are three common contexts.

Stop & Search Disproportionality

Stop and search rates in England and Wales are recorded by ethnicity per 1,000 population. The ratio of a group's rate to the White rate is the Relative Likelihood Ratio (RLR) — or in this context, the Disproportionality Rate.

0.007

LR for White
(7 per 1,000)

0.038

LR for Black
(38 per 1,000)

5.4×

RLR
(Black ÷ White)

Stops per 1,000 population (illustrative figures)

White

Asian

Mixed

Black

Figures illustrative based on published Home Office data patterns. Reference group = White. RLR = Black LR ÷ White LR.

What does this RLR tell us?

A Black person is approximately 5–6× more likely to be stopped than a White person. This doesn't tell us whether individual stops were justified — but it tells us the distribution of policing activity is highly unequal, and that a lawful, proportionate explanation is required. In the absence of one, this constitutes evidence of indirect discrimination under the Equality Act 2010.

Misconduct & Disciplinary Proceedings

In organisations, Relative Likelihood Ratios can reveal whether employees from certain groups face disciplinary action at disproportionate rates — even when controlling for performance or seniority.

Illustrative example: disciplinary rates in a 500-person organisation

Group	Employee Count	Disciplinaries	Rate per 100	RLR
White	320	16	5.0%	1.0 (reference LR)
Asian	80	6	7.5%	1.5×
Black	60	7	11.7%	2.3×
Mixed / Other	40	4	10.0%	2.0×

⚖️

A Black employee is 2.3× more likely to face disciplinary action than a White employee in this example — an RLR of 2.3. Under the PSED (Public Sector Equality Duty), public bodies must publish and act on this type of data. An RLR above 2.0 in a workforce context is a significant red flag.

Promotion & Progression

RLRs can be applied in reverse — when the outcome is positive (e.g. promotion), an RLR below 1.0 for a group means they are less likely to progress, revealing potential barriers.

Promotion success rates: 200 applicants over 3 years

Group	Applied	Promoted	Success Rate	RLR
Male (non-disabled)	80	32	40%	1.0 (reference LR)
Female (non-disabled)	70	21	30%	0.75×
Male (disabled)	30	9	30%	0.75×
Female (disabled)	20	4	20%	0.50×

🔍

A disabled woman has an RLR of 0.50 — she is only half as likely to be promoted as a non-disabled man. This is an intersectional finding — neither gender nor disability alone fully explains the disparity. RLRs applied intersectionally are a powerful tool for revealing compounding disadvantage.

03 — Interactive

Relative Likelihood Ratio Calculator

Enter the outcome counts and population sizes for two groups. The calculator derives each group's Likelihood Ratio, then divides them to produce the Relative Likelihood Ratio.

Group A (Focus Group)

Group Name Number with Outcome Total Population / Eligible

Group B (Reference Group)

Group Name Number with Outcome Total Population / Eligible

5.4×

Relative Likelihood Ratio (RLR)

A Black person is 5.4× more likely to experience this outcome than a White person.

04 — Advanced

Simpson's Paradox

A statistical phenomenon where a trend that appears in aggregated data reverses — or disappears — when the data is broken down into subgroups. In equality work, it can both hide and reveal discrimination.

The Classic Problem

Imagine an organisation where, in aggregate, women appear to be promoted at a much lower rate than men. Does this mean women are being treated unfairly within the organisation's promotion processes? Not necessarily — and the answer matters, because it determines the right intervention.

If women are concentrated in departments with lower overall promotion rates, and men are concentrated in departments with higher overall promotion rates, the aggregate can make the overall picture look worse than what is happening in any individual department — even if women are actually favoured within every department.

Worked Example: Promotion in a Two-Department Organisation

Department	Gender	Applied	Promoted	Rate	RLR (W vs M)
Operations (lower promo rate)	Men	10	1	10%	1.50 ✓ Women favoured
Operations (lower promo rate)	Women	40	6	15%	1.50 ✓ Women favoured
Strategy (higher promo rate)	Men	40	20	50%	1.20 ✓ Women favoured
Strategy (higher promo rate)	Women	10	6	60%	1.20 ✓ Women favoured
All combined	Men	50	21	42%	0.57 ✗ Women appear disadvantaged
	Women	50	12	24%	0.57 ✗ Women appear disadvantaged

🔄

Within each department, women are promoted at a higher rate than men (RLR 1.50 and 1.20). Yet in the aggregate, women appear to be promoted at less than half the rate of men (RLR 0.57). The paradox arises because most women are in Operations — where promotion is rare for everyone — while most men are in Strategy, where promotion is common. The problem is not how women are treated within departments; it is how they are distributed across them. That is a structural barrier, not a departmental one — and it calls for a different intervention.

Interactive Demonstration: Gender Pay Gap in Policing

In policing, officers are generally paid more than police staff — and officers are more likely to be men, while staff are more likely to be women. Toggle between views to see how this shapes the headline pay gap figure.

⟺

Overall median hourly pay — all officers and staff combined (illustrative)

Men

£19.00/hr

£19.00

Women

£14.50/hr

£14.50

Gender Pay Gap: 23.7%

👁️

The aggregate shows a 23.7% gender pay gap. At face value it appears women are paid substantially less. But does this reflect pay discrimination within roles — or something about the composition of the workforce? Switch to disaggregated view →

📅 A Real-World Twist: Year-on-Year, Everything Got Worse — Yet the Headline Improved

The officers/staff dynamic creates a further paradox when tracked over time. In the following illustrative example — based on a real pattern observed in force-level GPG reporting — the pay gap worsened within both officers and staff year-on-year, yet the aggregate headline figure improved.

Year 1 — Baseline

Group	Headcount	Men avg pay	Women avg pay	GPG
Officers	300 (75% M, 25% F)	£22.00	£21.00	4.5%
Staff	200 (30% M, 70% F)	£14.00	£13.00	7.1%
All combined	500	£20.32	£15.79	22.3%

Year 2 — 30 new female officers recruited at entry-level pay (£19.00)

Group	Headcount	Men avg pay	Women avg pay	GPG
Officers	330 (68% M, 32% F)	£22.50	£20.79 ↓	7.6% ↑ worse
Staff	200 (30% M, 70% F)	£14.50	£13.20	9.0% ↑ worse
All combined	530	£20.82	£16.45	21.0% ↓ better

How Can This Be True?

The officer pay gap widened because 30 new women joined at entry-level officer pay (£19.00), pulling the average women's officer pay down — even though each individual woman is paid in line with her male equivalent at the same point on the scale. The staff gap widened for separate reasons. Yet the aggregate gap narrowed, because those 30 new female officers — despite being at the bottom of the officer pay scale — still earn more than most staff. Their arrival shifted the overall composition of the female workforce upward, improving the aggregate figure while masking deterioration in both underlying groups.

A headline improvement can therefore coexist with — and actively conceal — worsening conditions in every subgroup. This is precisely why aggregate GPG reporting, mandated under the Equality Act 2010, tells only part of the story.

📌

The diagnostic question: Is the aggregate figure improving because conditions are genuinely improving — or because the composition of the workforce is changing? Only disaggregation can tell you which.

Simpson's Paradox in Policing: Use of Force

The same paradox can appear in use-of-force data. The key is understanding not just the rates, but the volumes behind them — specifically, which groups are concentrated in which encounter types.

In this illustrative example, vehicle stops involve higher force rates for everyone, and Black people are concentrated in foot patrol; White people are concentrated in vehicle stops.

Context	Group	Encounters	Force used	Rate (LR)	RLR
Foot patrol lower force rate type	White	200	10	5.0%	1.8× Black higher
Foot patrol lower force rate type	Black	800	72	9.0%	1.8× Black higher
Vehicle stops higher force rate type	White	800	120	15.0%	1.33× Black higher
Vehicle stops higher force rate type	Black	200	40	20.0%	1.33× Black higher
All combined	White	1,000	130	13.0%	0.86× Black appears LOWER
All combined	Black	1,000	112	11.2%	0.86× Black appears LOWER

🔍

Why does this happen? Black people have 800 of their 1,000 encounters in foot patrol — the lower-force type — while White people have 800 of their 1,000 encounters in vehicle stops, where force is used more often by everyone. Black people experience more force within each encounter type (RLRs of 1.8× and 1.33×), but their concentration in the lower-force context pulls their aggregate rate down below White. The aggregate RLR of 0.86 — apparently reassuring — actively conceals a consistent pattern of higher force use against Black people in every context.

⚠️

Aggregate RLRs can actively mislead. Always show the volumes. A rate without a denominator — and without the distribution of that denominator across subgroups — is an incomplete picture.

05 — Summary

Key Takeaways

What practitioners need to know when working with relative likelihood ratios and aggregated equality data.

✅ On Relative Likelihood Ratios

An RLR quantifies disparity — it is the ratio of two groups' Likelihood Ratios, telling you how many times more (or less) likely an outcome is for one group compared to a reference group.

It is a starting point, not a verdict. An RLR of 3.0 demands explanation. It can be lawfully justified by a proportionate means to a legitimate aim — but the burden to demonstrate this increases with the size of the ratio.

Intersectionality matters. Run RLRs for combinations of protected characteristics, not just single-axis comparisons. The greatest disparities often appear at the intersections.

RLRs below 1.0 matter too. When the outcome is positive (promotion, development, pay), an RLR below 1.0 signals under-representation, not safety.

⚠️ On Simpson's Paradox

Never trust aggregate-only data in equality analysis. Always ask: what confounding variable might be lurking in the structure of the data?

Common confounders in equality data include: job level/grade, length of service, department, geographic location, shift pattern, and contract type.

The paradox can work both ways. It can make discrimination invisible (the GPG example), or it can make a neutral system appear discriminatory. Both distortions are dangerous.

Causation vs. composition. Always ask: is this gap because of how people are treated within groups, or because of how groups are distributed across categories? The answer determines the intervention.

The Practitioner's Rule

Report both the aggregate and the disaggregated figures. When they tell different stories, that difference is itself the finding — and often the most important one.

Figures used in this explainer are illustrative and based on published data patterns. They are intended to demonstrate analytical concepts, not to represent exact current statistics.

Likelihood Ratios vs Relative Likelihood Ratios

Likelihood Ratio (LR)

Relative Likelihood Ratio (RLR)

Reading the RLR — Direction Matters

⚖️ The Four-Fifths Rule (0.80 / 1.25)

📊 Statistical Significance: Don't Overlook Small Numbers

Equality Examples

Stop & Search Disproportionality

What does this RLR tell us?

Misconduct & Disciplinary Proceedings

Promotion & Progression

Relative Likelihood Ratio Calculator

Group A (Focus Group)

Group B (Reference Group)

Simpson's Paradox

The Classic Problem

Interactive Demonstration: Gender Pay Gap in Policing

Simpson's Paradox Revealed

📅 A Real-World Twist: Year-on-Year, Everything Got Worse — Yet the Headline Improved

How Can This Be True?

Simpson's Paradox in Policing: Use of Force

Key Takeaways

✅ On Relative Likelihood Ratios

⚠️ On Simpson's Paradox

The Practitioner's Rule