{"id":1547,"date":"2021-12-26T17:06:50","date_gmt":"2021-12-26T10:06:50","guid":{"rendered":"https:\/\/mintea.blog\/?p=1547"},"modified":"2021-12-26T17:07:05","modified_gmt":"2021-12-26T10:07:05","slug":"1547","status":"publish","type":"post","link":"https:\/\/mintea.blog\/?p=1547","title":{"rendered":"A\/B Testing: A Complete Guide to Statistical Testing"},"content":{"rendered":"<p><strong>A\/B Testing: A Complete Guide to Statistical Testing<\/strong><\/p>\n<p>For marketers and data scientists alike, it\u2019s crucial to set up the right test.<\/p>\n<p>What is A\/B testing?<\/p>\n<p><strong>A\/B testing<\/strong>\u00a0is one of the most popular controlled experiments used to optimize web marketing strategies. It allows decision makers to choose the best design for a website by looking at the analytics results obtained with two possible alternatives A and B.<\/p>\n<p>In this article we\u2019ll see how different statistical methods can be used to make A\/B testing successful. I recommend you to also have a look at\u00a0<a href=\"https:\/\/github.com\/FrancescoCasalegno\/AB_Testing\/blob\/main\/AB_Testing.ipynb\" target=\"_blank\" rel=\"noopener\"><strong>this notebook\u00a0<\/strong><\/a>where you can play with the examples discussed in this article.<\/p>\n<p>To understand what\u00a0A\/B testing is about, let\u2019s consider two alternative designs: A and B. Visitors of a website are randomly served with one of the two. Then, data about their activity is collected by web analytics. Given this data, one can apply statistical tests to determine whether one of the two designs has better efficacy.<\/p>\n<p>Now, different kinds of metrics can be used to measure a website efficacy. With\u00a0<strong>discrete metrics<\/strong>, also called\u00a0<strong>binomial metrics<\/strong>, only the two values\u00a0<strong>0<\/strong>\u00a0and\u00a0<strong>1<\/strong>\u00a0are possible. The following are examples of popular discrete metrics.<\/p>\n<ul>\n<li><a href=\"https:\/\/en.wikipedia.org\/wiki\/Click-through_rate\" target=\"_blank\" rel=\"noopener\">Click-through rate<\/a>\u00a0\u2014 if a user is shown an advertisement, do they click on it?<\/li>\n<li><a href=\"https:\/\/en.wikipedia.org\/wiki\/Conversion_rate_optimization\" target=\"_blank\" rel=\"noopener\">Conversion rate<\/a>\u00a0\u2014 if a user is shown an advertisement, do they convert into customers?<\/li>\n<li><a href=\"https:\/\/en.wikipedia.org\/wiki\/Bounce_rate\" target=\"_blank\" rel=\"noopener\">Bounce rate<\/a>\u00a0\u2014 if a user is visits a website, is the following visited page on the same website?<\/li>\n<\/ul>\n<p><img loading=\"lazy\" decoding=\"async\" width=\"700\" height=\"209\" class=\"wp-image-1548\" src=\"https:\/\/mintea.blog\/wp-content\/uploads\/2021\/12\/graphical-user-interface-application-description.png\" alt=\"Graphical user interface, application\n\nDescription automatically generated\" srcset=\"https:\/\/mintea.blog\/wp-content\/uploads\/2021\/12\/graphical-user-interface-application-description.png 700w, https:\/\/mintea.blog\/wp-content\/uploads\/2021\/12\/graphical-user-interface-application-description-300x90.png 300w\" sizes=\"auto, (max-width: 700px) 100vw, 700px\" \/><\/p>\n<p>Discrete metrics: click-through rate (image by author)<\/p>\n<p>With\u00a0<strong>continuous metrics<\/strong>, also called\u00a0<strong>non-binomial metrics<\/strong>,, the metric may take continuous values that are not limited to a set two discrete states. The following are examples of popular continuous metrics.<\/p>\n<ul>\n<li><a href=\"https:\/\/en.wikipedia.org\/wiki\/Average_revenue_per_user\" target=\"_blank\" rel=\"noopener\">Average revenue per user<\/a>\u00a0\u2014 how much revenue does a user generate in a month?<\/li>\n<li><a href=\"https:\/\/en.wikipedia.org\/wiki\/Session_(web_analytics)\" target=\"_blank\" rel=\"noopener\">Average session duration<\/a>\u00a0\u2014 for how long does a user stay on a website in a session?<\/li>\n<li><a href=\"https:\/\/www.optimizely.com\/optimization-glossary\/average-order-value\/\" target=\"_blank\" rel=\"noopener\">Average order value<\/a>\u00a0\u2014 what is the total value of the order of a user?<\/li>\n<\/ul>\n<p><img loading=\"lazy\" decoding=\"async\" width=\"700\" height=\"209\" class=\"wp-image-1549\" src=\"https:\/\/mintea.blog\/wp-content\/uploads\/2021\/12\/graphical-user-interface-application-description-1.png\" alt=\"Graphical user interface, application\n\nDescription automatically generated\" srcset=\"https:\/\/mintea.blog\/wp-content\/uploads\/2021\/12\/graphical-user-interface-application-description-1.png 700w, https:\/\/mintea.blog\/wp-content\/uploads\/2021\/12\/graphical-user-interface-application-description-1-300x90.png 300w\" sizes=\"auto, (max-width: 700px) 100vw, 700px\" \/><\/p>\n<p>Continuous metrics: average order value (image by author)<\/p>\n<p>We are going to see in detail how discrete and continuous metrics require different statistical test. But first, let\u2019s quickly review some fundamental concepts of statistics.<\/p>\n<p>Statistical significance<\/p>\n<p>With the data we collected from the activity of users of our website, we can compare the efficacy of the two designs A and B. Simply comparing mean values wouldn\u2019t be very meaningful, as we would fail to assess the\u00a0<strong>statistical significance<\/strong>\u00a0of our observations.\u00a0It is indeed fundamental to determine how likely it is that the observed discrepancy between the two samples originates from chance.<\/p>\n<p>In order to do that, we will use a\u00a0<a href=\"https:\/\/en.wikipedia.org\/wiki\/Two-sample_hypothesis_testing\" target=\"_blank\" rel=\"noopener\">two-sample hypothesis test<\/a>. Our\u00a0<strong>null hypothesis H0<\/strong>\u00a0is that the two designs A and B have the same efficacy, i.e. that they produce an equivalent click-through rate, or average revenue per user, etc. The statistical significance is then measured by the\u00a0<strong>p-value<\/strong>, i.e. the probability of observing a discrepancy between our samples at least as strong as the one that we actually observed.<\/p>\n<p><img loading=\"lazy\" decoding=\"async\" width=\"700\" height=\"65\" class=\"wp-image-1550\" src=\"https:\/\/mintea.blog\/wp-content\/uploads\/2021\/12\/word-image-48.png\" srcset=\"https:\/\/mintea.blog\/wp-content\/uploads\/2021\/12\/word-image-48.png 700w, https:\/\/mintea.blog\/wp-content\/uploads\/2021\/12\/word-image-48-300x28.png 300w\" sizes=\"auto, (max-width: 700px) 100vw, 700px\" \/><\/p>\n<p>P-value (image by author)<\/p>\n<p>Now, some care has to be applied to properly choose the\u00a0<strong>alternative hypothesis Ha<\/strong>. This choice corresponds to the choice between\u00a0<a href=\"https:\/\/en.wikipedia.org\/wiki\/One-_and_two-tailed_tests\" target=\"_blank\" rel=\"noopener\">one- and two- tailed tests\u00a0<\/a>.<\/p>\n<p>A\u00a0<strong>two-tailed test<\/strong>\u00a0is preferable in our case, since we have no reason to know a priori whether the discrepancy between the results of A and B will be in favor of A or B. This means that we consider the alternative hypothesis\u00a0<strong>Ha<\/strong>\u00a0the hypothesis that A and B have different efficacy.<\/p>\n<p><img loading=\"lazy\" decoding=\"async\" width=\"700\" height=\"236\" class=\"wp-image-1551\" src=\"https:\/\/mintea.blog\/wp-content\/uploads\/2021\/12\/chart-description-automatically-generated-2.png\" alt=\"Chart\n\nDescription automatically generated\" srcset=\"https:\/\/mintea.blog\/wp-content\/uploads\/2021\/12\/chart-description-automatically-generated-2.png 700w, https:\/\/mintea.blog\/wp-content\/uploads\/2021\/12\/chart-description-automatically-generated-2-300x101.png 300w\" sizes=\"auto, (max-width: 700px) 100vw, 700px\" \/><\/p>\n<p>One- and Two-tailed tests (image by author)<\/p>\n<p>The\u00a0<strong>p-value<\/strong>\u00a0is therefore computed as the area under the the two tails of the probability density function\u00a0<strong>p(x)<\/strong>\u00a0of a chosen test statistic on all\u00a0<strong>x\u2019<\/strong>\u00a0s.t.\u00a0<strong>p(x\u2019) &lt;= p(our observation)<\/strong>. The computation of such p-value clearly depends on the data distribution. So we will first see how to compute it for discrete metrics, and then for continuous metrics.<\/p>\n<p>Discrete metrics<\/p>\n<p>Let\u2019s first consider a discrete metric such as the click-though rate. We randomly show visitors one of two possible designs of an advertisement, and we keep track of how many of them click on it.<\/p>\n<p>Let\u2019s say that from we collected the following information.<\/p>\n<ul>\n<li><strong>nX = 15<\/strong>\u00a0visitors saw the advertisement A, and\u00a0<strong>7<\/strong>\u00a0of them clicked on it.<\/li>\n<li><strong>nY = 19<\/strong>\u00a0visitors saw the advertisement B, and\u00a0<strong>15<\/strong>\u00a0of them clicked on it.<\/li>\n<\/ul>\n<p><img loading=\"lazy\" decoding=\"async\" width=\"700\" height=\"159\" class=\"wp-image-1552\" src=\"https:\/\/mintea.blog\/wp-content\/uploads\/2021\/12\/table-description-automatically-generated-11.png\" alt=\"Table\n\nDescription automatically generated\" srcset=\"https:\/\/mintea.blog\/wp-content\/uploads\/2021\/12\/table-description-automatically-generated-11.png 700w, https:\/\/mintea.blog\/wp-content\/uploads\/2021\/12\/table-description-automatically-generated-11-300x68.png 300w\" sizes=\"auto, (max-width: 700px) 100vw, 700px\" \/><\/p>\n<p>Click-through ratios: contingency table (image by author)<\/p>\n<p>At a first glance, it looks like version B was more effective, but how statistically significant is this discrepancy?<\/p>\n<p>Fisher\u2019s exact test<\/p>\n<p>Using the 2&#215;2\u00a0<a href=\"https:\/\/en.wikipedia.org\/wiki\/Contingency_table\" target=\"_blank\" rel=\"noopener\">contingency table shown above\u00a0<\/a>we can use\u00a0<a href=\"https:\/\/en.wikipedia.org\/wiki\/Fisher%27s_exact_test\" target=\"_blank\" rel=\"noopener\">Fisher\u2019s exact test<\/a>\u00a0to compute an exact p-value and test our hypothesis. To understand how this test works, let us start by noticing that if we fix the margins of the tables (i.e. the four sums of each row and column), then only few different outcomes are possible.<\/p>\n<p><img loading=\"lazy\" decoding=\"async\" width=\"700\" height=\"338\" class=\"wp-image-1553\" src=\"https:\/\/mintea.blog\/wp-content\/uploads\/2021\/12\/graphical-user-interface-application-description-2.png\" alt=\"Graphical user interface, application\n\nDescription automatically generated\" srcset=\"https:\/\/mintea.blog\/wp-content\/uploads\/2021\/12\/graphical-user-interface-application-description-2.png 700w, https:\/\/mintea.blog\/wp-content\/uploads\/2021\/12\/graphical-user-interface-application-description-2-300x145.png 300w\" sizes=\"auto, (max-width: 700px) 100vw, 700px\" \/><\/p>\n<p>Click-through ratios: possible outcomes (image by author)<\/p>\n<p>Now, the key observation is that, under the null hypothesis H0 that A and B have same efficacy, the probability of observing any of these possible outcomes is given by the\u00a0<a href=\"https:\/\/en.wikipedia.org\/wiki\/Hypergeometric_distribution\" target=\"_blank\" rel=\"noopener\">hypergeometric distribution\u00a0<\/a>.<\/p>\n<p><img loading=\"lazy\" decoding=\"async\" width=\"700\" height=\"98\" class=\"wp-image-1554\" src=\"https:\/\/mintea.blog\/wp-content\/uploads\/2021\/12\/graphical-user-interface-description-automaticall-1.png\" alt=\"Graphical user interface\n\nDescription automatically generated with low confidence\" srcset=\"https:\/\/mintea.blog\/wp-content\/uploads\/2021\/12\/graphical-user-interface-description-automaticall-1.png 700w, https:\/\/mintea.blog\/wp-content\/uploads\/2021\/12\/graphical-user-interface-description-automaticall-1-300x42.png 300w\" sizes=\"auto, (max-width: 700px) 100vw, 700px\" \/><\/p>\n<p>Hypergeometric distribution of possible outcomes (image by author)<\/p>\n<p>Using this formula we obtain that:<\/p>\n<ul>\n<li>the probability of seeing our actual observations is\u00a0<strong>~4.5%<\/strong><\/li>\n<li>the probability of seeing even more unlikely observations in favor if B is\u00a0<strong>~1.0%<\/strong>\u00a0(left tail);<\/li>\n<li>the probability of seeing observations even more unlikely observations in favor if A is\u00a0<strong>~2.0%<\/strong>\u00a0(right tail).<\/li>\n<\/ul>\n<p><img loading=\"lazy\" decoding=\"async\" width=\"700\" height=\"432\" class=\"wp-image-1555\" src=\"https:\/\/mintea.blog\/wp-content\/uploads\/2021\/12\/chart-histogram-description-automatically-genera.png\" alt=\"Chart, histogram\n\nDescription automatically generated\" srcset=\"https:\/\/mintea.blog\/wp-content\/uploads\/2021\/12\/chart-histogram-description-automatically-genera.png 700w, https:\/\/mintea.blog\/wp-content\/uploads\/2021\/12\/chart-histogram-description-automatically-genera-300x185.png 300w, https:\/\/mintea.blog\/wp-content\/uploads\/2021\/12\/chart-histogram-description-automatically-genera-180x110.png 180w\" sizes=\"auto, (max-width: 700px) 100vw, 700px\" \/><\/p>\n<p>Click-through ratios: tails and p-value (image by author)<\/p>\n<p>So Fisher\u2019s exact test gives\u00a0<strong>p-value \u2248 7.5%<\/strong>.<\/p>\n<p>Pearson\u2019s chi-squared test<\/p>\n<p>Fisher\u2019s exact test has the important advantage of computing exact p-values. But if we have a large sample size, it may be computationally inefficient. In this case, we can use\u00a0<a href=\"https:\/\/en.wikipedia.org\/wiki\/Pearson%27s_chi-square_test\" target=\"_blank\" rel=\"noopener\">Pearson\u2019s chi-squared test<\/a>\u00a0to compute an approximate p-value.<\/p>\n<p>Let us call\u00a0<strong>Oij<\/strong>\u00a0the observed value of the contingency table at row\u00a0<strong>i<\/strong>\u00a0and column\u00a0<strong>j<\/strong>. Under the null hypothesis of independence of rows and columns, i.e. assuming that A and B have same efficacy, we can easily compute corresponding expected values\u00a0<strong>Eij<\/strong>. Moreover, if the observations are normally distributed, then the \u03c72 statistic follows exactly a\u00a0<a href=\"https:\/\/en.wikipedia.org\/wiki\/Chi-square_distribution\" target=\"_blank\" rel=\"noopener\">chi-square distribution\u00a0<\/a>with 1 degree of freedom.<\/p>\n<p><img loading=\"lazy\" decoding=\"async\" width=\"700\" height=\"98\" class=\"wp-image-1556\" src=\"https:\/\/mintea.blog\/wp-content\/uploads\/2021\/12\/a-picture-containing-text-description-automatical.png\" alt=\"A picture containing text\n\nDescription automatically generated\" srcset=\"https:\/\/mintea.blog\/wp-content\/uploads\/2021\/12\/a-picture-containing-text-description-automatical.png 700w, https:\/\/mintea.blog\/wp-content\/uploads\/2021\/12\/a-picture-containing-text-description-automatical-300x42.png 300w\" sizes=\"auto, (max-width: 700px) 100vw, 700px\" \/><\/p>\n<p>Pearson\u2019s chi-squared test (image by author)<\/p>\n<p>In fact, this test can also be used with non-normal observations if the sample size is large enough, thanks to the\u00a0<a href=\"https:\/\/en.wikipedia.org\/wiki\/Central_limit_theorem\" target=\"_blank\" rel=\"noopener\">central limit theorem<\/a>.<\/p>\n<p>In our example, using Pearson\u2019s chi-square test we obtain\u00a0<strong>\u03c72 \u2248 3.825<\/strong>, which gives\u00a0<strong>p-value \u2248 5.1%<\/strong>.<\/p>\n<p>Continuous metrics<\/p>\n<p>Let\u2019s now consider the case of a continuous metric such as the average revenue per user. We randomly show visitors one of two possible layouts of our website, and based on how much revenue each user generates in a month we want to determine if one of the two layouts is more efficient.<\/p>\n<p>Let\u2019s consider the following case.<\/p>\n<ul>\n<li><strong>nX = 17<\/strong>\u00a0users saw the layout A, and then made the following purchases: 200$, 150$, 250$, 350$, 150$, 150$, 350$, 250$, 150$, 250$, 150$, 150$, 200$, 0$, 0$, 100$, 50$.<\/li>\n<li><strong>nX = 14<\/strong>\u00a0users saw the layout B, and then made the following purchases: 300$, 150$, 150$, 400$, 250$, 250$, 150$, 200$, 250$, 150$, 300$, 200$, 250$, 200$.<\/li>\n<\/ul>\n<p><img loading=\"lazy\" decoding=\"async\" width=\"700\" height=\"432\" class=\"wp-image-1557\" src=\"https:\/\/mintea.blog\/wp-content\/uploads\/2021\/12\/chart-bar-chart-description-automatically-genera.png\" alt=\"Chart, bar chart\n\nDescription automatically generated\" srcset=\"https:\/\/mintea.blog\/wp-content\/uploads\/2021\/12\/chart-bar-chart-description-automatically-genera.png 700w, https:\/\/mintea.blog\/wp-content\/uploads\/2021\/12\/chart-bar-chart-description-automatically-genera-300x185.png 300w, https:\/\/mintea.blog\/wp-content\/uploads\/2021\/12\/chart-bar-chart-description-automatically-genera-180x110.png 180w\" sizes=\"auto, (max-width: 700px) 100vw, 700px\" \/><\/p>\n<p>Average revenue per user: samples distribution (image by author)<\/p>\n<p>Again, at a first glance, it looks like version B was more effective. But how statistically significant is this discrepancy?<\/p>\n<p>Z-test<\/p>\n<p>The\u00a0<a href=\"https:\/\/en.wikipedia.org\/wiki\/Z-test\" target=\"_blank\" rel=\"noopener\">Z-test<\/a>\u00a0can be applied under the following assumptions.<\/p>\n<ul>\n<li>The observations are normally distributed (or the sample size is large).<\/li>\n<li>The sampling distributions have known variance\u00a0<strong>\u03c3X<\/strong>\u00a0and\u00a0<strong>\u03c3Y<\/strong>.<\/li>\n<\/ul>\n<p>Under the above assumptions, the Z-test exploits the fact that the following\u00a0<strong>Z statistic<\/strong>\u00a0has a standard normal distribution.<\/p>\n<p><img loading=\"lazy\" decoding=\"async\" width=\"700\" height=\"98\" class=\"wp-image-1558\" src=\"https:\/\/mintea.blog\/wp-content\/uploads\/2021\/12\/diagram-description-automatically-generated-13.png\" alt=\"Diagram\n\nDescription automatically generated\" srcset=\"https:\/\/mintea.blog\/wp-content\/uploads\/2021\/12\/diagram-description-automatically-generated-13.png 700w, https:\/\/mintea.blog\/wp-content\/uploads\/2021\/12\/diagram-description-automatically-generated-13-300x42.png 300w\" sizes=\"auto, (max-width: 700px) 100vw, 700px\" \/><\/p>\n<p>Z-test (image by author)<\/p>\n<p>Unfortunately in most real applications the standard deviations are unknown and must be estimated, so a t-test is preferable, as we will see later. Anyway, if in our case we knew the true value of\u00a0<strong>\u03c3X=100<\/strong>\u00a0and\u00a0<strong>\u03c3X=90<\/strong>, then we would obtain\u00a0<strong>z \u2248 -1.697<\/strong>, which corresponds to a\u00a0<strong>p-value \u2248 9%<\/strong>.<\/p>\n<p>Student\u2019s t-test<\/p>\n<p>In most cases, the variances of the sampling distributions are unknown, so that we need to estimate them.\u00a0<a href=\"https:\/\/en.wikipedia.org\/wiki\/Student%27s_t-test\" target=\"_blank\" rel=\"noopener\">Student\u2019s t-test<\/a>\u00a0can then be applied under the following assumptions.<\/p>\n<ul>\n<li>The observations are normally distributed (or the sample size is large).<\/li>\n<li>The sampling distributions have \u201csimilar\u201d variances\u00a0<strong>\u03c3X \u2248 \u03c3Y<\/strong>.<\/li>\n<\/ul>\n<p>Under the above assumptions, Student\u2019s t-test relies on the observation that the following\u00a0<strong>t statistic<\/strong>\u00a0has a Student\u2019s t distribution.<\/p>\n<p><img loading=\"lazy\" decoding=\"async\" width=\"700\" height=\"98\" class=\"wp-image-1559\" src=\"https:\/\/mintea.blog\/wp-content\/uploads\/2021\/12\/text-letter-description-automatically-generated.png\" alt=\"Text, letter\n\nDescription automatically generated\" srcset=\"https:\/\/mintea.blog\/wp-content\/uploads\/2021\/12\/text-letter-description-automatically-generated.png 700w, https:\/\/mintea.blog\/wp-content\/uploads\/2021\/12\/text-letter-description-automatically-generated-300x42.png 300w\" sizes=\"auto, (max-width: 700px) 100vw, 700px\" \/><\/p>\n<p>Student\u2019s t-test<\/p>\n<p>Here\u00a0<strong>SP<\/strong>\u00a0is the\u00a0<a href=\"https:\/\/en.wikipedia.org\/wiki\/Pooled_variance\" target=\"_blank\" rel=\"noopener\">pooled standard deviation<\/a>\u00a0obtained from the sample variances\u00a0<strong>SX<\/strong>\u00a0and\u00a0<strong>S Y<\/strong>, which are computed using the unbiased formula that applies\u00a0<a href=\"https:\/\/en.wikipedia.org\/wiki\/Bessel%27s_correction\" target=\"_blank\" rel=\"noopener\">Bessel\u2019s correction\u00a0<\/a>).<\/p>\n<p>In our example, using Student\u2019s t-test we obtain\u00a0<strong>t \u2248 -1.789<\/strong>\u00a0and\u00a0<strong>\u03bd = 29<\/strong>, which give\u00a0<strong>p-value \u2248 8.4%<\/strong>.<\/p>\n<p>Welch\u2019s t-test<\/p>\n<p>In most cases Student\u2019s t test can be effectively applied with good results. However, it may rarely happen that its second assumption (similar variance of the sampling distributions) is violated. In that case, we cannot compute a pooled variance and rather than Student\u2019s t test we should use\u00a0<a href=\"https:\/\/en.wikipedia.org\/wiki\/Welch%27s_t-test\" target=\"_blank\" rel=\"noopener\">Welch\u2019s t-test<\/a>.<\/p>\n<p>This test operates under the same assumptions of Student\u2019s t-test but removes the requirement on the similar variances. Then, we can use a slightly different\u00a0<strong>t statistic<\/strong>, which also has a Student\u2019s t distribution, but with a different number of degrees of freedom\u00a0<strong>\u03bd<\/strong>.<\/p>\n<p><img loading=\"lazy\" decoding=\"async\" width=\"700\" height=\"97\" class=\"wp-image-1560\" src=\"https:\/\/mintea.blog\/wp-content\/uploads\/2021\/12\/text-description-automatically-generated-1.png\" alt=\"Text\n\nDescription automatically generated\" srcset=\"https:\/\/mintea.blog\/wp-content\/uploads\/2021\/12\/text-description-automatically-generated-1.png 700w, https:\/\/mintea.blog\/wp-content\/uploads\/2021\/12\/text-description-automatically-generated-1-300x42.png 300w\" sizes=\"auto, (max-width: 700px) 100vw, 700px\" \/><\/p>\n<p>Welch\u2019s t-test<\/p>\n<p>The complex formula for\u00a0<strong>\u03bd<\/strong>\u00a0comes from\u00a0<a href=\"https:\/\/en.wikipedia.org\/wiki\/Welch%E2%80%93Satterthwaite_equation\" target=\"_blank\" rel=\"noopener\">Welch\u2013Satterthwaite equation\u00a0<\/a>.<\/p>\n<p>In our example, using Welch\u2019s t-test we obtain\u00a0<strong>t \u2248 -1.848<\/strong>\u00a0and\u00a0<strong>\u03bd \u2248 28.51<\/strong>, which give\u00a0<strong>p-value \u2248 7.5%<\/strong>.<\/p>\n<p>Continuous non-normal metrics<\/p>\n<p>In the previous section on continuous metrics, we assumed that our observations came from normal distributions. But non-normal distributions are extremely common when dealing with per-user monthly revenues etc. There are several ways in which normality is often violated:<\/p>\n<ul>\n<li><a href=\"https:\/\/en.wikipedia.org\/wiki\/Zero-inflated_model\" target=\"_blank\" rel=\"noopener\">zero-inflated distributions\u00a0<\/a>\u2014 most user don\u2019t buy anything at all, so lots of zero observations;<\/li>\n<li><a href=\"https:\/\/en.wikipedia.org\/wiki\/Multimodal_distribution\" target=\"_blank\" rel=\"noopener\">multimodal distributions<\/a>\u00a0\u2014 a market segment tends purchases cheap products, while another segment purchases more expensive products.<\/li>\n<\/ul>\n<p><img loading=\"lazy\" decoding=\"async\" width=\"700\" height=\"398\" class=\"wp-image-1561\" src=\"https:\/\/mintea.blog\/wp-content\/uploads\/2021\/12\/a-picture-containing-histogram-description-automa.png\" alt=\"A picture containing histogram\n\nDescription automatically generated\" srcset=\"https:\/\/mintea.blog\/wp-content\/uploads\/2021\/12\/a-picture-containing-histogram-description-automa.png 700w, https:\/\/mintea.blog\/wp-content\/uploads\/2021\/12\/a-picture-containing-histogram-description-automa-300x171.png 300w\" sizes=\"auto, (max-width: 700px) 100vw, 700px\" \/><\/p>\n<p>Continuous non-normal distribution (image by author)<\/p>\n<p>However, if we have enough samples, tests derived under normality assumptions like Z-test, Student\u2019s t-test, and Welch\u2019s t-test can still be applied for observations that signficantly deviate from normality. Indeed, thanks to the\u00a0<a href=\"https:\/\/en.wikipedia.org\/wiki\/Central_limit_theorem\" target=\"_blank\" rel=\"noopener\">central limit theorem<\/a>, the distribution of the test statistics tends to normality as the sample size increases. In the zero-inflated and multimodal example we are considering, even a sample size of 40 produces a distribution that is well approximated by a normal distribution.<\/p>\n<p><img loading=\"lazy\" decoding=\"async\" width=\"700\" height=\"435\" class=\"wp-image-1562\" src=\"https:\/\/mintea.blog\/wp-content\/uploads\/2021\/12\/chart-histogram-description-automatically-genera-1.png\" alt=\"Chart, histogram\n\nDescription automatically generated\" srcset=\"https:\/\/mintea.blog\/wp-content\/uploads\/2021\/12\/chart-histogram-description-automatically-genera-1.png 700w, https:\/\/mintea.blog\/wp-content\/uploads\/2021\/12\/chart-histogram-description-automatically-genera-1-300x186.png 300w\" sizes=\"auto, (max-width: 700px) 100vw, 700px\" \/><\/p>\n<p>Convergence to normality of a non-normal distribution (image by author)<\/p>\n<p>But if the sample size is still too small to assume normality, we have no other choice than using a non-parametric approach such as the Mann-Whitney U test.<\/p>\n<p>Mann\u2013Whitney U test<\/p>\n<p>This test makes no assumption on the nature of the sampling distributions, so it is fully nonparametric. The idea of\u00a0<a href=\"https:\/\/en.wikipedia.org\/wiki\/Mann%E2%80%93Whitney_U_test\" target=\"_blank\" rel=\"noopener\">Mann-Whitney U test<\/a>\u00a0is to compute the following\u00a0<strong>U statistic<\/strong>.<\/p>\n<p><img loading=\"lazy\" decoding=\"async\" width=\"700\" height=\"98\" class=\"wp-image-1563\" src=\"https:\/\/mintea.blog\/wp-content\/uploads\/2021\/12\/a-picture-containing-graphical-user-interface-des-1.png\" alt=\"A picture containing graphical user interface\n\nDescription automatically generated\" srcset=\"https:\/\/mintea.blog\/wp-content\/uploads\/2021\/12\/a-picture-containing-graphical-user-interface-des-1.png 700w, https:\/\/mintea.blog\/wp-content\/uploads\/2021\/12\/a-picture-containing-graphical-user-interface-des-1-300x42.png 300w\" sizes=\"auto, (max-width: 700px) 100vw, 700px\" \/><\/p>\n<p>Mann-Whitney U test (image by author)<\/p>\n<p>The values of this test statistic are tabulated, as the distribution can be computed under the null hypothesis that, for random samples\u00a0<strong>X<\/strong>\u00a0and\u00a0<strong>Y<\/strong>\u00a0from the two populations, the probability\u00a0<strong>P(X &lt; Y)<\/strong>\u00a0is the same as\u00a0<strong>P(X &gt; Y)<\/strong>.<\/p>\n<p>In our example, using Mann-Whitney U test we obtain\u00a0<strong>u = 76<\/strong>\u00a0which gives\u00a0<strong>p-value \u2248 8.0%<\/strong>.<\/p>\n<p>Conclusion<\/p>\n<p>In this article we have seen that different kinds of metrics, sample size, and sampling distributions require different kinds of statistical tests for computing the the significance of A\/B tests. We can summarize all these possibilities in the form of a decision tree.<\/p>\n<p><img loading=\"lazy\" decoding=\"async\" width=\"700\" height=\"322\" class=\"wp-image-1564\" src=\"https:\/\/mintea.blog\/wp-content\/uploads\/2021\/12\/diagram-description-automatically-generated-14.png\" alt=\"Diagram\n\nDescription automatically generated\" srcset=\"https:\/\/mintea.blog\/wp-content\/uploads\/2021\/12\/diagram-description-automatically-generated-14.png 700w, https:\/\/mintea.blog\/wp-content\/uploads\/2021\/12\/diagram-description-automatically-generated-14-300x138.png 300w\" sizes=\"auto, (max-width: 700px) 100vw, 700px\" \/><\/p>\n<p>Summary of the statistical tests to be used for A\/B testing (image by author)<\/p>\n<p>If you want to know more, you can start by playing with\u00a0<a href=\"https:\/\/github.com\/FrancescoCasalegno\/AB_Testing\/blob\/main\/AB_Testing.ipynb\" target=\"_blank\" rel=\"noopener\"><strong>this notebook<\/strong>\u00a0<\/a>where you can see all the examples discussed in this article!<\/p>\n<p>&nbsp;<\/p>\n","protected":false},"excerpt":{"rendered":"<p>A\/B Testing: A Complete Guide to Statistical Testing For marketers and data scientists alike, it\u2019s crucial to set up the right test. What is A\/B testing? A\/B testing\u00a0is one of the most popular controlled experiments used to optimize web marketing strategies. It allows decision makers to choose the best design for a website by looking &hellip; <a href=\"https:\/\/mintea.blog\/?p=1547\" class=\"more-link\">Continue reading <span class=\"screen-reader-text\">A\/B Testing: A Complete Guide to Statistical Testing<\/span><\/a><\/p>\n","protected":false},"author":1,"featured_media":1549,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[25],"tags":[32,26,58,52,51,59],"class_list":["post-1547","post","type-post","status-publish","format-standard","has-post-thumbnail","hentry","category-bookmarked-articles","tag-analytic","tag-data","tag-data-modelling","tag-machine-learning","tag-modelling","tag-statistic"],"_links":{"self":[{"href":"https:\/\/mintea.blog\/index.php?rest_route=\/wp\/v2\/posts\/1547","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/mintea.blog\/index.php?rest_route=\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/mintea.blog\/index.php?rest_route=\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/mintea.blog\/index.php?rest_route=\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/mintea.blog\/index.php?rest_route=%2Fwp%2Fv2%2Fcomments&post=1547"}],"version-history":[{"count":2,"href":"https:\/\/mintea.blog\/index.php?rest_route=\/wp\/v2\/posts\/1547\/revisions"}],"predecessor-version":[{"id":1566,"href":"https:\/\/mintea.blog\/index.php?rest_route=\/wp\/v2\/posts\/1547\/revisions\/1566"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/mintea.blog\/index.php?rest_route=\/wp\/v2\/media\/1549"}],"wp:attachment":[{"href":"https:\/\/mintea.blog\/index.php?rest_route=%2Fwp%2Fv2%2Fmedia&parent=1547"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/mintea.blog\/index.php?rest_route=%2Fwp%2Fv2%2Fcategories&post=1547"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/mintea.blog\/index.php?rest_route=%2Fwp%2Fv2%2Ftags&post=1547"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}