Nick Young

Nick Young

Analysis and visualizations - for the interesting and inane.

16 Jul 2025

Grade Expectations: Exploring A/B Testing & Feature Selection Through Student GPAs

Abstract

I’ve been seeing a lot of demand for A/B testing experience in jobs I’ve been applying to lately. While I’ve done this type of work in academia, I can’t say I’ve formally performed it in a professional context. Not often, but on some occasions when I tell this to recruiters, I can feel the burgeoning hesitation they feel in advancing me to the next step. I’m kind of amazed when this is the limiting factor that prohibits me from advancing, because A/B testing in practice is not actually very complicated. Sure, there may be some more nuance to the context around it in a professional setting that one needs to understand, but the mechanics of the technique itself are not complex. In this post, I’m going to describe exactly what A/B testing is, how to perform it, its use as a feature selection method, and springboard into more advanced feature selection methods. After I’ve given a generalist approach to defining the A/B test, I will implement it in an example around synthetic student GPA data to better showcase its use.

What is A/B testing?

Contrary to the reductionist language I’m seeing online and in interviews, an A/B test is not actually a specific statistical test, but actually a type of experimental design where we sample from a population and split them into a control group and experimental group (groups A and B). Then we subject some type of update from the status-quo to the experimental group and measure the occurrence of some metric between the two groups. Whether you know it or not, you’re probably familiar with some.

Do you understand how in drug trials one group is given a placebo and another group is given some new drug to determine if it has any efficacy? This is an A/B test. Another type of A/B test that I’m being asked if I have experience with is product A/B testing. Let’s take a generic social media company for example. They have a vested interest in increasing the amount of time their users actually use their platform. They have some theory that the layout of their mobile application on a very specific phone is not optimized, thereby losing user engagement on that phone. To test if they can increase user engagement among users of this phone, they might run an A/B test featuring some new layout.

This is the high-level organization of the experimental design: having two groups (A/B) and then impacting a change you’re trying to research the efficacy of one of said groups.

Important Preliminary Details

Before launching an A/B test, it’s essential to ensure several foundational conditions are met. These safeguards help ensure that the results are valid, interpretable, and actionable.

An appropriate sample size

Before you run an A/B test, it’s important to make sure your sample size is large enough. If your sample is too small, you might not be able to tell whether the difference you see between groups is real or just due to random chance.

Let’s take a coin flip as an example. If I flipped a quarter 3 times and got heads twice, I might think heads comes up more often than tails. But with such a small number of flips, that result could easily be a fluke. Now imagine I flipped the coin 1,000 times and got 520 heads — I’d feel much more confident saying the coin is probably fair (or close to it).

A/B tests work the same way. The more people or events included in the test, the more confident we can be in the results. In fact, the statistical test we use will output a number — often called a p-value or confidence level — that tells us how likely it is that the difference between groups is real. This number is heavily influenced by how much data we collect.

That’s why planning for sample size is so important. If we want to be very confident in our results, we’ll need a large enough sample. But in real-world settings, there’s often a trade-off: running a longer test gives us more certainty, while shorter tests might give faster answers (and quicker product changes) but with more uncertainty.

Reasonably identical features along A/B

For an A/B test to be fair, the control group and the experiment group should be similar in every important way except for the thing you’re testing.

Imagine testing a new medication designed to help people with a specific illness. It wouldn’t make sense to include people without that illness in the experiment group but only people with the illness in the control group. That would make it impossible to know if differences in results were due to the medicine or just differences in who was in each group.

Similarly, if there’s another health condition that could affect how well the medicine works, you’d want to make sure that condition is either equally represented in both groups or excluded altogether. If not, the results could be misleading.

This idea applies to any A/B test: the groups should be as similar as possible so you can be confident that differences in outcomes are caused by the change you made — not other factors.

Random Sampling

To make sure the control and experiment groups are as similar as possible, we assign users to each group using random sampling.

Random sampling means each user has an equal chance of being placed in either group. This helps avoid bias — so the groups don’t differ in any unexpected ways that could affect the results.

By randomly assigning users, we can be more confident that any differences we see are due to the change we’re testing, not other factors.

So if A/B is an experimental design, what’s the test?

So, we’ve organized equal-sized and similar-featured group A and B to run a test… now what? Well, that all depends on what we’re trying to measure here. Are we expecting a higher frequency of a binary outcome (a patient is cured / not cured), or higher rates of a continuous value (the changes to a study app helped yield higher SAT scores among users)? Does our continuous value follow an approximately normal distribution? All these help us to choose the type of test we will run.

The Student’s T-Test

A very common type of statistical test run in an A/B experiment design is the Student’s T-test (published by a statistician and brewery chemist using the pseudonym student). This is used to compare the mean value of a variable of interest across our groups to determine if there is a statistically significant difference. I.e., is the average weight between this group that took a new weight loss pill and their placebo control group different and is that difference meaningful or just random chance? Note that this is not the same question as does this drug cause weight loss? A/B tests do NOT imply causality.

There are a few different types of t-test depending (one sample, two sample, paired), but without going into these I’ll just say two sample tests are what are used in A/B testing (two groups = two samples).

Z-test

Similar to the Student’s T-Test, however this measures difference in a binary value. I.e., does this product change yield more customer sign-ups than the status-quo. This has similar one sample, two sample, paired flavors as the t-test.

Other Tests

These are the two primary tests that get utilized in A/B testing. There are a myriad of other statistical tests that fit different purposes in particular scenarios. Non-parametric counterparts to the t and z-test (Mann-Whitney U and Chi-square tests, respectively) can be used in lieu of the t and z-test when the assumption of normality is not met. However they compare sample medians or ranks, where as the t and z-test compare means or proportions.

There are also tests that extend beyond the A/B structure. ANOVA (analysis of variance) can be used to detect if there is some statistically significant variance among any number of groups, not just two.

In Practice

That was a bit more of a foreword than I probably wanted, but there was some exposition I needed to do for those unfamiliar. Now, I want to jump into a set of data I came across to show how we can use the common tests used in A/B experiment design for feature selection.

So, here we have a synthetic dataset representing the exam scores of 80,000 students and potentially related predicting variables.

This is a head of all the data to familiarize yourself with the features.

student_id
age
gender
major
study_hours_per_day
social_media_hours
netflix_hours
part_time_job
attendance_percentage
sleep_hours
diet_quality
exercise_frequency
parental_education_level
internet_quality
mental_health_rating
extracurricular_participation
previous_gpa
semester
stress_level
dropout_risk
social_activity
screen_time
study_environment
access_to_tutoring
family_income_range
parental_support_level
motivation_level
exam_anxiety_score
learning_style
time_management_score
exam_score
100000
26
Male
Computer Science
7.64536653374835
3
0.1
Yes
70.3
6.2
Poor
3
Some College
High
6
Yes
4
5
5.8
No
0
10.9
Co-Learning Group
Yes
High
9
7
8
Reading
3
100
100001
28
Male
Arts
5.7
0.5
0.4
No
88.4
7.2
Good
4
PhD
Low
6.8
No
4
7
5.8
No
5
8.3
Co-Learning Group
Yes
Low
7
2
10
Reading
6
99
100002
17
Male
Arts
2.4
4.2
0.7
No
82.1
9.2
Good
4
High School
Low
5.7
Yes
3.79
4
8
No
5
8
Library
Yes
High
3
9
6
Kinesthetic
7.6
98
100003
27
Other
Psychology
3.4
4.6
2.3
Yes
79.3
4.2
Fair
3
Master
Medium
8.5
Yes
4
6
4.6
No
3
11.7
Co-Learning Group
Yes
Low
5
3
10
Reading
3.2
100
100004
25
Female
Business
4.7
0.8
2.7
Yes
62.9
6.5
Good
6
PhD
Low
9.2
No
4
4
5.7
No
2
9.4
Quiet Room
Yes
Medium
9
1
10
Reading
7.1
98
100005
17
Female
Engineering
4.2
0.5
2.5
No
49.2
9.2
Good
3
PhD
High
9.2
Yes
4
3
1.5
No
2
8.9
Quiet Room
No
High
1
10
5
Visual
3.9
94

And some descriptive summary statistics to understand the shape of the data a bit more.

Data Frame Summary

grades_and_features

Dimensions: 80000 x 31
Duplicates: 0
Variable Stats / Values Freqs (% of Valid) Graph Missing
student_id [numeric]
Mean (sd) : 139999.5 (23094.2)
min ≤ med ≤ max:
1e+05 ≤ 139999.5 ≤ 179999
IQR (CV) : 39999.5 (0.2)
80000 distinct values 0 (0.0%)
age [numeric]
Mean (sd) : 22 (3.7)
min ≤ med ≤ max:
16 ≤ 22 ≤ 28
IQR (CV) : 6 (0.2)
13 distinct values 0 (0.0%)
gender [character]
1. Female
2. Male
3. Other
26705(33.4%)
26698(33.4%)
26597(33.2%)
0 (0.0%)
major [character]
1. Arts
2. Biology
3. Business
4. Computer Science
5. Engineering
6. Psychology
13505(16.9%)
13201(16.5%)
13276(16.6%)
13352(16.7%)
13229(16.5%)
13437(16.8%)
0 (0.0%)
study_hours_per_day [numeric]
Mean (sd) : 4.2 (2)
min ≤ med ≤ max:
0 ≤ 4.1 ≤ 12
IQR (CV) : 2.7 (0.5)
13364 distinct values 0 (0.0%)
social_media_hours [numeric]
Mean (sd) : 2.5 (1.4)
min ≤ med ≤ max:
0 ≤ 2.5 ≤ 5
IQR (CV) : 2.6 (0.6)
51 distinct values 0 (0.0%)
netflix_hours [numeric]
Mean (sd) : 2 (1.2)
min ≤ med ≤ max:
0 ≤ 2 ≤ 4
IQR (CV) : 2 (0.6)
41 distinct values 0 (0.0%)
part_time_job [character]
1. No
2. Yes
40195(50.2%)
39805(49.8%)
0 (0.0%)
attendance_percentage [numeric]
Mean (sd) : 70 (17.3)
min ≤ med ≤ max:
40 ≤ 69.9 ≤ 100
IQR (CV) : 29.9 (0.2)
601 distinct values 0 (0.0%)
sleep_hours [numeric]
Mean (sd) : 7 (1.5)
min ≤ med ≤ max:
4 ≤ 7 ≤ 12
IQR (CV) : 2 (0.2)
81 distinct values 0 (0.0%)
diet_quality [character]
1. Fair
2. Good
3. Poor
26713(33.4%)
39935(49.9%)
13352(16.7%)
0 (0.0%)
exercise_frequency [numeric]
Mean (sd) : 3.5 (2.3)
min ≤ med ≤ max:
0 ≤ 4 ≤ 7
IQR (CV) : 4 (0.7)
0:9908(12.4%)
1:9925(12.4%)
2:9928(12.4%)
3:9958(12.4%)
4:10043(12.6%)
5:9980(12.5%)
6:10206(12.8%)
7:10052(12.6%)
0 (0.0%)
parental_education_level [character]
1. Bachelor
2. High School
3. Master
4. PhD
5. Some College
15986(20.0%)
16106(20.1%)
15908(19.9%)
15885(19.9%)
16115(20.1%)
0 (0.0%)
internet_quality [character]
1. High
2. Low
3. Medium
26749(33.4%)
26714(33.4%)
26537(33.2%)
0 (0.0%)
mental_health_rating [numeric]
Mean (sd) : 6.8 (1.9)
min ≤ med ≤ max:
1 ≤ 6.9 ≤ 10
IQR (CV) : 2.7 (0.3)
91 distinct values 0 (0.0%)
extracurricular_participation [character]
1. No
2. Yes
40058(50.1%)
39942(49.9%)
0 (0.0%)
previous_gpa [numeric]
Mean (sd) : 3.6 (0.5)
min ≤ med ≤ max:
1.6 ≤ 3.8 ≤ 4
IQR (CV) : 0.7 (0.1)
220 distinct values 0 (0.0%)
semester [numeric]
Mean (sd) : 4.5 (2.3)
min ≤ med ≤ max:
1 ≤ 5 ≤ 8
IQR (CV) : 5 (0.5)
1:10151(12.7%)
2:10000(12.5%)
3:9894(12.4%)
4:9868(12.3%)
5:10026(12.5%)
6:9978(12.5%)
7:10180(12.7%)
8:9903(12.4%)
0 (0.0%)
stress_level [numeric]
Mean (sd) : 5 (2)
min ≤ med ≤ max:
1 ≤ 5 ≤ 10
IQR (CV) : 2.8 (0.4)
91 distinct values 0 (0.0%)
dropout_risk [character]
1. No
2. Yes
78418(98.0%)
1582(2.0%)
0 (0.0%)
social_activity [numeric]
Mean (sd) : 2.5 (1.7)
min ≤ med ≤ max:
0 ≤ 2 ≤ 5
IQR (CV) : 3 (0.7)
0:13202(16.5%)
1:13373(16.7%)
2:13511(16.9%)
3:13356(16.7%)
4:13235(16.5%)
5:13323(16.7%)
0 (0.0%)
screen_time [numeric]
Mean (sd) : 9.7 (2.8)
min ≤ med ≤ max:
0.3 ≤ 9.7 ≤ 21
IQR (CV) : 3.8 (0.3)
198 distinct values 0 (0.0%)
study_environment [character]
1. Cafe
2. Co-Learning Group
3. Dorm
4. Library
5. Quiet Room
16038(20.0%)
15949(19.9%)
16119(20.1%)
15903(19.9%)
15991(20.0%)
0 (0.0%)
access_to_tutoring [character]
1. No
2. Yes
40039(50.0%)
39961(50.0%)
0 (0.0%)
family_income_range [character]
1. High
2. Low
3. Medium
26653(33.3%)
26720(33.4%)
26627(33.3%)
0 (0.0%)
parental_support_level [numeric]
Mean (sd) : 5.5 (2.9)
min ≤ med ≤ max:
1 ≤ 5 ≤ 10
IQR (CV) : 5 (0.5)
1:8153(10.2%)
2:8067(10.1%)
3:7945(9.9%)
4:7980(10.0%)
5:8045(10.1%)
6:8164(10.2%)
7:7869(9.8%)
8:7863(9.8%)
9:8023(10.0%)
10:7891(9.9%)
0 (0.0%)
motivation_level [numeric]
Mean (sd) : 5.5 (2.9)
min ≤ med ≤ max:
1 ≤ 5 ≤ 10
IQR (CV) : 5 (0.5)
1:8022(10.0%)
2:8038(10.0%)
3:8075(10.1%)
4:7888(9.9%)
5:8013(10.0%)
6:8102(10.1%)
7:8064(10.1%)
8:7955(9.9%)
9:7988(10.0%)
10:7855(9.8%)
0 (0.0%)
exam_anxiety_score [numeric]
Mean (sd) : 8.5 (1.8)
min ≤ med ≤ max:
5 ≤ 10 ≤ 10
IQR (CV) : 3 (0.2)
5:7855(9.8%)
6:7988(10.0%)
7:7955(9.9%)
8:8064(10.1%)
9:8102(10.1%)
10:40036(50.0%)
0 (0.0%)
learning_style [character]
1. Auditory
2. Kinesthetic
3. Reading
4. Visual
19802(24.8%)
20143(25.2%)
19997(25.0%)
20058(25.1%)
0 (0.0%)
time_management_score [numeric]
Mean (sd) : 5.5 (2.6)
min ≤ med ≤ max:
1 ≤ 5.5 ≤ 10
IQR (CV) : 4.6 (0.5)
91 distinct values 0 (0.0%)
exam_score [numeric]
Mean (sd) : 89.1 (11.6)
min ≤ med ≤ max:
36 ≤ 93 ≤ 100
IQR (CV) : 18 (0.1)
65 distinct values 0 (0.0%)

Generated by summarytools 1.1.4 (R version 4.5.1)
2025-08-27

It’s important to remember, this is synthetic data. Whoever made this has done very well to ensure there are near even distributions between our categorical variables and continuous numeric variables are roughly normally distributed.

Forming a Hypothesis

All inferential statistical tests begin with a hypothesis. For an A/B test we hope or hypothesize that there will be some measurable difference in outcome between two groups.

So let me throw out some arbitrary hypothesis.

H0:Women and men are equally smart (no difference in exam_scores)Ha:Women are smarter than men (will have higher exam_scores) H_0: Women\ and\ men\ are\ equally\ smart\ (no\ difference\ in\ exam\_scores) \\ H_{a}: Women\ are\ smarter\ than\ men\ (will\ have\ higher\ exam\_scores )

The Mechanics of the Test

Assuming you know how to calculate an average and a standard deviation, this is the formula to get the t-statistic that will allow us to compare the means of our two groups.

With the value t we do further work to compare against a critical value we obtain from a t-distribution, but going into that is a little further than I want to write about. Just let it be known we can calculate this by hand given some resources - or we can use statistical packages in R to quickly output the results.

# seed for reproducability
set.seed(6391)

# work has been done separately to ensure there are no confounding variables

# sampling 5000 rows of women
rando_women <- grades_and_features %>%
  filter(gender == 'Female') %>%
  .[sample(nrow(.), 5000), ] %>%
  pull(exam_score)
    
# sampling 5000 rows of men
rando_men <- grades_and_features %>%
  filter(gender == 'Male') %>%
  .[sample(nrow(.), 5000), ] %>%
  pull(exam_score)

t.test(rando_men, rando_women)
## 
## 	Welch Two Sample t-test
## 
## data:  rando_men and rando_women
## t = 0.25821, df = 9998, p-value = 0.7962
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -0.3994445  0.5206445
## sample estimates:
## mean of x mean of y 
##   89.0694   89.0088

The p-value tells us how likely it is that the difference in means we observed could happen just by chance if there were no real difference in the population. A p-value of 0.7962 means the observed difference is not statistically significant. Likewise, we can take a look at the means of both groups and see how they are nearly identical. In this case we do not have enough evidence to reject the null hypothesis and must accept that women and men have roughly the same exam scores.

As a Tool for Feature Selection.

Now consider beyond testing various groups for meaningful differences in exam scores, that I was even going so far as to develop to formula to help predict exam score. I could make a linear regression model, or some derivation of this model type, but then how would I go about choosing what features make the most sense? We’ve already seen there’s no meaningful difference in exam scores based on gender. I guess I could just continue applying some inferential test along all the categorical variables in my data, but that could take a long time and would not help me choose amongst the numeric variables.

Advanced Feature Selection Methods

As a little addendum to this primer on inferential methods, I’ll talk a bit about feature selection too. Consider that I want to build a type of model that takes some input and outputs a prediction of exam score. How will I find out what are the most important input variab