{"id":1888,"date":"2022-04-11T10:56:26","date_gmt":"2022-04-11T03:56:26","guid":{"rendered":"https:\/\/mintea.blog\/?p=1888"},"modified":"2022-04-11T10:56:49","modified_gmt":"2022-04-11T03:56:49","slug":"1888","status":"publish","type":"post","link":"https:\/\/mintea.blog\/?p=1888","title":{"rendered":"Overfitting &#8211; Understanding overfitting"},"content":{"rendered":"<p><strong>Understanding overfitting: an inaccurate meme in Machine Learning<\/strong><\/p>\n<p><strong>Preamble<\/strong><br \/>\nThere is a lot of confusion among practitioners regarding the concept of\u00a0<a href=\"https:\/\/en.wikipedia.org\/wiki\/Overfitting\">overfitting<\/a>. It seems like, a kind of\u00a0<em>an<\/em>\u00a0<em>urban legend<\/em>\u00a0or a\u00a0<em>meme, a folklore\u00a0<\/em>is circulating\u00a0in data science or allied fields with the following statement:<\/p>\n<p><em>Applying\u00a0<\/em><a href=\"https:\/\/en.wikipedia.org\/wiki\/Cross-validation_(statistics)\"><em>cross-validation<\/em><\/a><em>\u00a0prevents overfitting and a good out-of-sample performance, low generalisation error in unseen data, indicates not an overfit.<\/em><\/p>\n<p>This statement is of course not true: cross-validation does not prevent your model to overfit and good out-of-sample performance does not guarantee not-overfitted model. What actually people refer to in one aspect of this statement is called\u00a0<em>overtraining<\/em>. Unfortunately, this meme is not only propagated in industry but in some academic papers as well. This might be at best a confusion on\u00a0<em>jargon<\/em>. But, it will be a good practice if we set the\u00a0<em>jargon<\/em>\u00a0right and clear on what do we refer to when we say\u00a0<em>overfitting,<\/em>\u00a0in communicating our results.<\/p>\n<p><strong>Aim<\/strong><br \/>\nIn this post, we will give an intuition on why\u00a0<a href=\"https:\/\/en.wikipedia.org\/wiki\/Regression_validation\">model validation<\/a>\u00a0as approximating\u00a0generalization error of a\u00a0<a href=\"https:\/\/en.wikipedia.org\/wiki\/Goodness_of_fit\">model fit<\/a>\u00a0and detection of overfitting can not be resolved simultaneously on a single model. We will work on \u00a0a concrete example workflow in understanding\u00a0<em>overfitting<\/em>,\u00a0<em>overtraining<\/em>\u00a0and a typical final model building stage \u00a0after some conceptual introduction. We will avoid giving a reference to the Bayesian interpretations and regularisation and restrict the post to regression and cross-validation. While regularisation has different ramification due to its mathematical properties and prior distributions have different implications in Bayesian statistics. We assume an introductory background in machine learning, so this is not a beginners tutorial.<\/p>\n<p>A recent question from\u00a0<a href=\"https:\/\/en.wikipedia.org\/wiki\/Andrew_Gelman\">Andrew Gelman<\/a>, a Bayesian guru, regarding\u00a0<em>What is overfitting?<\/em>\u00a0was one of the reasons why this post is developed along with my frustration to see practitioners being muddy on the meaning of\u00a0<em>overfitting<\/em>\u00a0and continuing recently published data science related technical articles circulating around and even in some academic papers claiming the above statement.<\/p>\n<p><strong>What do we need to satisfy in supervised learning?\u00a0<\/strong><br \/>\nOne of the most basic tasks in mathematics is to find a solution to a function: If we restrict ourselves to real numbers in\u00a0<em>n<\/em>-dimensions and our domain of interest would be\u00a0<strong>R<\/strong><em><sup>n<\/sup><\/em>. Now imagine set of\u00a0<em>p<\/em>\u00a0points living in this domain\u00a0<em>x<\/em><sub>i<\/sub>\u00a0form a dataset, this is actually\u00a0<em>a partial<\/em>\u00a0\u00a0solution to a function. The main purpose of modelling is to find an explanation of the\u00a0dataset, meaning that we need to determine\u00a0<em>m<\/em>-parameters,\u00a0a\u2208<strong>R<\/strong><sup>m<\/sup>\u00a0which are\u00a0unknown. (Note that a non-parametric model does not mean no parameters.) Mathematically speaking this manifests as a function as we said before, \u00a0f(x,a). This modelling is usually called\u00a0<em>regression<\/em>,\u00a0<em>interpolation<\/em>\u00a0or\u00a0<em>supervised learning\u00a0<\/em>depending on the literature you are reading. This is a form of\u00a0<a href=\"https:\/\/en.wikipedia.org\/wiki\/Inverse_problem\">an inverse problem<\/a>, while we don\u2019t know the parameters but we have a partial information regarding variables. The main issue is that solutions are not\u00a0<a href=\"https:\/\/en.wikipedia.org\/wiki\/Well-posed_problem\">well-posed<\/a>. Omitting axiomatic technical details, practical problem is that we can find many functions\u00a0f(x,a)\u00a0or models, explaining the\u00a0dataset.<\/p>\n<p>So, we seek the following two concepts to be satisfied by our model solution,\u00a0\u00a0f(x,a)=0.<\/p>\n<p>1. Generalized: A model should not depend on the dataset. This step is called\u00a0<em>model validation.<\/em><br \/>\n2. Minimally complex: A model should obey\u00a0<a href=\"https:\/\/en.wikipedia.org\/wiki\/Occam%27s_razor\">Occam\u2019s razor or principle of parsimony<\/a>. This step is called\u00a0<em>model selection<\/em>.<\/p>\n<p><img loading=\"lazy\" decoding=\"async\" width=\"320\" height=\"238\" class=\"wp-image-1889\" src=\"https:\/\/mintea.blog\/wp-content\/uploads\/2022\/04\/diagram-description-automatically-generated-3.png\" alt=\"Diagram\n\nDescription automatically generated\" srcset=\"https:\/\/mintea.blog\/wp-content\/uploads\/2022\/04\/diagram-description-automatically-generated-3.png 320w, https:\/\/mintea.blog\/wp-content\/uploads\/2022\/04\/diagram-description-automatically-generated-3-300x223.png 300w\" sizes=\"auto, (max-width: 320px) 100vw, 320px\" \/><\/p>\n<p>Figure 1: A workflow for model validation and selection in supervised learning.<\/p>\n<p>Generalization of a model can be measured by\u00a0<a href=\"https:\/\/en.wikipedia.org\/wiki\/Goodness_of_fit\">goodness-of-fit<\/a>. It essentially tells us how good our model (chosen function) explains the dataset. To find a\u00a0minimally\u00a0complex model<strong>\u00a0<\/strong>requires comparison against another model.<\/p>\n<p>Up to now, we have not named a technique how to check if a\u00a0model is generalized and selected as the best model. Unfortunately, there is no unique way of doing both and that\u2019s the task of data scientist or quantitative practitioner that requires human judgement.<\/p>\n<p><strong>Model validation: An example\u00a0<\/strong><br \/>\nOne way to check if a model is generalized enough is to come up with a metric on how good it explains the dataset. Our task in model validation is to estimate the model error. For example,\u00a0<a href=\"https:\/\/en.wikipedia.org\/wiki\/Root-mean-square_deviation\">root mean square deviation<\/a>\u00a0(RMDS) is one metric we can use. \u00a0If \u00a0RMSD is low, we could say that our model fit is good, ideally it should be close to zero. \u00a0But it is not generalized enough if we use the same dataset to measure the goodness-of-fit. \u00a0We could use different dataset, specially out-of-sample dataset, to validate this as much as we can, i.e. so called hold out method. \u00a0Out-of-sample is just\u00a0a fancy way of saying we did not use the same dataset to find the value of parameters\u00a0a. An improved way of doing this is cross-validation. We split our dataset into\u00a0k partitions, and we obtain\u00a0k RMDS values to averaged over. This is summarised in Figure 1. \u00a0Note that, different parameterisation of the same model does not constitute a different model.<\/p>\n<p><strong>Model Selection: Detection of overfitting\u00a0<\/strong><br \/>\nOverfitting comes into play when we try to satisfy \u2018minimally complex model\u2019. This is a comparison problem and we need more than one model to judge if a given model is an overfit.\u00a0Douglas Hawkins in his classic paper\u00a0<a href=\"http:\/\/pubs.acs.org\/doi\/abs\/10.1021\/ci0342472\"><em>The Problem of Overfitting<\/em><\/a>, states that<\/p>\n<p><em>Overfitting of models is widely recognized as a concern. It is less recognized however that overfitting is not an absolute but involves a comparison. A model overfits if it is more complex than another model that fits equally well.<\/em><\/p>\n<p>The important point here what do we mean by complex model, or how can we quantify model complexity? Unfortunately, again there is no unique way of doing this. One of the most used approaches is that a model having more parameters is getting more complex. But this is again a bit of a\u00a0<em>meme<\/em>\u00a0and not generally true. One could actually resort to different measures of complexity. For example, by this definition\u00a0f1(a,x)=ax\u00a0and\u00a0f2(a,x)=ax<sup>2<\/sup>\u00a0have the same complexity by having the same number of free parameters, but intuitively\u00a0f2\u00a0is more complex, while it is nonlinear. There are a lot of information theory based measures of complexity but discussion of those are beyond the scope of our post. For demonstration purposes, we will consider more parameters and degree of\u00a0nonlinearity as more complex a model.<\/p>\n<p><img loading=\"lazy\" decoding=\"async\" width=\"320\" height=\"320\" class=\"wp-image-1890\" src=\"https:\/\/mintea.blog\/wp-content\/uploads\/2022\/04\/chart-line-chart-description-automatically-gener-1.png\" alt=\"Chart, line chart\n\nDescription automatically generated\" srcset=\"https:\/\/mintea.blog\/wp-content\/uploads\/2022\/04\/chart-line-chart-description-automatically-gener-1.png 320w, https:\/\/mintea.blog\/wp-content\/uploads\/2022\/04\/chart-line-chart-description-automatically-gener-1-300x300.png 300w, https:\/\/mintea.blog\/wp-content\/uploads\/2022\/04\/chart-line-chart-description-automatically-gener-1-150x150.png 150w, https:\/\/mintea.blog\/wp-content\/uploads\/2022\/04\/chart-line-chart-description-automatically-gener-1-100x100.png 100w\" sizes=\"auto, (max-width: 320px) 100vw, 320px\" \/><\/p>\n<p>Figure 2: Simulated data and the non-stochastic part of the data.<\/p>\n<p><strong>Hand on example<\/strong><br \/>\nWe have intuitively covered the reasons behind how we can\u2019t resolve model validation and judge overfitting simultaneously. Now try to demonstrate this with a simple dataset and models, yet essentially capturing the above premise.<\/p>\n<p>A usual procedure is to generate a synthetic dataset, or simulated dataset, from a model, as a gold standard and use this dataset to build other models. Let\u2019s use the following functional form, from\u00a0<a href=\"https:\/\/www.springer.com\/de\/book\/9780387310732?referer=www.springer.de\">classic text of Bishop<\/a>, but with an added Gaussian noise<\/p>\n<p>f(x)=sin(2\u03c0x)+N(0,0.1).<\/p>\n<p>We generate large enough set, 100 points to avoid sample size issue discussed in Bishop\u2019s book, see Figure 2. Let\u2019s decide on two models we would like to apply to this dataset in supervised learning task. Note that, we won\u2019t be discussing Bayesian interpretation here, so equivalency of these model under a strong prior assumption is not an issue as we are using this example for ease of demonstrating the concept. A polynomial model of degree 3 and degree 6, we call them g(x) and h(x) respectively, are used to learn from the simulated data.<\/p>\n<p>g(x)=a<sub>0<\/sub>\u00a0+ a<sub>1<\/sub>x + a<sub>2<\/sub>x<sup>2<\/sup>\u00a0+ a<sub>3<\/sub>x<sup>3<\/sup><\/p>\n<p>and<\/p>\n<p>h(x)=b<sub>0<\/sub>\u00a0+ b<sub>1<\/sub>x + b<sub>2<\/sub>x<sup>2<\/sup>\u00a0+ b<sub>3<\/sub>x<sup>3<\/sup>\u00a0+ b<sub>4<\/sub>x<sup>4<\/sup>\u00a0+ b<sub>5<\/sub>x<sup>5<\/sup>\u00a0+ b<sub>6<\/sub>x<sup>6<\/sup><\/p>\n<p><img loading=\"lazy\" decoding=\"async\" width=\"320\" height=\"320\" class=\"wp-image-1891\" src=\"https:\/\/mintea.blog\/wp-content\/uploads\/2022\/04\/diagram-description-automatically-generated-4.png\" alt=\"Diagram\n\nDescription automatically generated\" srcset=\"https:\/\/mintea.blog\/wp-content\/uploads\/2022\/04\/diagram-description-automatically-generated-4.png 320w, https:\/\/mintea.blog\/wp-content\/uploads\/2022\/04\/diagram-description-automatically-generated-4-300x300.png 300w, https:\/\/mintea.blog\/wp-content\/uploads\/2022\/04\/diagram-description-automatically-generated-4-150x150.png 150w, https:\/\/mintea.blog\/wp-content\/uploads\/2022\/04\/diagram-description-automatically-generated-4-100x100.png 100w\" sizes=\"auto, (max-width: 320px) 100vw, 320px\" \/><\/p>\n<p>Figure 3: Overtraining occurs at around after 40 percent of the data usage for g(x).<\/p>\n<p><strong>Overtraining is not overfitting<\/strong><br \/>\n<em>Overtraining<\/em>\u00a0means a model performance degrades in learning model parameters against an objective variable that effects how model is build, for example, an objective variable can be a training data size or iteration cycle in neural network. This is more prevalent in neural networks (see\u00a0<a href=\"http:\/\/onlinelibrary.wiley.com\/doi\/10.1002\/1097-0142(20010415)91:8%2B%3C1615::AID-CNCR1175%3E3.0.CO;2-L\/full\">Dayhoff 2011<\/a>). In our practical example, this will manifest in hold out method to measure RMSD in modelling with g(x). In other words finding an optimal number of data points to use to train the model to give a better performance on unseen data, See Figure 3 and 4.<\/p>\n<p><strong>Overfitting with low validation error<\/strong><br \/>\nWe can also estimate 10-fold cross-validation error, CV-RMSD. For this sampling, g and h have 0.13 and 0.12 CV-RMSD respectively. So as we can see, we have a situation that more complex model reaches similar predictive power with cross validation and we can not distinguish this overfitting by just looking at CV-RMSD value or detecting \u2018overtraining\u2019 curve from Figures 4. We need two models to compare, hence both Figure 3 and 4, with both CV-RMSD values. We might argue that in small data sets we might be able tell the difference by looking at test and training error differences, this is exactly how Bishop explains\u00a0<em>overfitting;<\/em>\u00a0where he points out\u00a0<em>overtraining<\/em>\u00a0in small datasets.<\/p>\n<p><strong>Which trained model to deploy?<\/strong><br \/>\nNow the question is, we found out best performing model with minimal complexity empirically. All well, but which trained model should we use in production?<br \/>\nActually we have already build the model in model selection. In above case, since we got similar<br \/>\npredictive power from g and h, we obviously will use g, trained on the splitting sweet spot from Figure 3.<\/p>\n<p><img loading=\"lazy\" decoding=\"async\" width=\"320\" height=\"320\" class=\"wp-image-1892\" src=\"https:\/\/mintea.blog\/wp-content\/uploads\/2022\/04\/diagram-description-automatically-generated-5.png\" alt=\"Diagram\n\nDescription automatically generated\" srcset=\"https:\/\/mintea.blog\/wp-content\/uploads\/2022\/04\/diagram-description-automatically-generated-5.png 320w, https:\/\/mintea.blog\/wp-content\/uploads\/2022\/04\/diagram-description-automatically-generated-5-300x300.png 300w, https:\/\/mintea.blog\/wp-content\/uploads\/2022\/04\/diagram-description-automatically-generated-5-150x150.png 150w, https:\/\/mintea.blog\/wp-content\/uploads\/2022\/04\/diagram-description-automatically-generated-5-100x100.png 100w\" sizes=\"auto, (max-width: 320px) 100vw, 320px\" \/><\/p>\n<p>Figure 4: Overtraining occurs at around after 30 percent of the data usage for h(x)<\/p>\n<p><strong>Conclusion<\/strong><br \/>\nThe essential message here is good validation performance would not guarantee the detection of an\u00a0<em>overfitted<\/em>\u00a0model. As we have seen from examples using synthetic data in one dimension.\u00a0<em>Overtraining<\/em>\u00a0is actually what most practitioners mean when they use the term\u00a0<em>overfitting<\/em>.<\/p>\n<p><strong>Outlook<\/strong><br \/>\nAs more and more people are using techniques from machine learning or inverse problems, both in academia and industry, some key technical concepts are deviated a bit and take different definitions and meaning for different people, due to the fact that people learn some concepts not from reading the literature carefully but from their line managers or senior colleagues verbally. This creates\u00a0<em>memes<\/em>\u00a0which are actually wrong or at least creating lots of confusion in jargon. It is very important for all of us as practitioners that we must\u00a0<em>question<\/em>\u00a0all technical concepts and try to seek origins from the published scientific literature and not rely entirely on verbal explanations from our experienced colleagues. Also, we should strongly avoid ridiculing question from colleagues even they sound too simple, at the end of the day we don\u2019t stop learning and naive looking questions might have very important consequences in fundamentals of the field.<\/p>\n<p><strong>P.S.\u00a0<\/strong>As I mentioned above, the inspiration of writing this post was, a recent post from Gelman (<a href=\"http:\/\/andrewgelman.com\/2017\/07\/15\/what-is-overfitting-exactly\/\">post<\/a>). He defined \u2018overfitting\u2019 as follows:<\/p>\n<p><em>Overfitting is when you have a complicated model that gives worse predictions, on average, than a simpler model.<\/em><\/p>\n<p>Priors and\u00a0equivalence\u00a0of two models aside, Gelman\u2019s definition is weaker than Hawkins definition, that he accepts a complex model having a similar predictive power. So, if we use Gelman\u2019s definitions it is ok to deploy either of\u00a0<em>g<\/em>\u00a0or\u00a0<em>h<\/em>in our toy example above. But strictly speaking from Hawkins perspective we need to deploy\u00a0<em>g<\/em>.<\/p>\n<p><a href=\"https:\/\/memosisland.blogspot.de\/2017\/08\/understanding-overfitting-inaccurate.html\">Original<\/a>. Reposted with permission.<\/p>\n<p>&nbsp;<\/p>\n","protected":false},"excerpt":{"rendered":"<p>Understanding overfitting: an inaccurate meme in Machine Learning Preamble There is a lot of confusion among practitioners regarding the concept of\u00a0overfitting. It seems like, a kind of\u00a0an\u00a0urban legend\u00a0or a\u00a0meme, a folklore\u00a0is circulating\u00a0in data science or allied fields with the following statement: Applying\u00a0cross-validation\u00a0prevents overfitting and a good out-of-sample performance, low generalisation error in unseen data, indicates &hellip; <a href=\"https:\/\/mintea.blog\/?p=1888\" class=\"more-link\">Continue reading <span class=\"screen-reader-text\">Overfitting &#8211; Understanding overfitting<\/span><\/a><\/p>\n","protected":false},"author":1,"featured_media":1890,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[25],"tags":[32,26,66,58,52,51,59],"class_list":["post-1888","post","type-post","status-publish","format-standard","has-post-thumbnail","hentry","category-bookmarked-articles","tag-analytic","tag-data","tag-data-fallacies","tag-data-modelling","tag-machine-learning","tag-modelling","tag-statistic"],"_links":{"self":[{"href":"https:\/\/mintea.blog\/index.php?rest_route=\/wp\/v2\/posts\/1888","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/mintea.blog\/index.php?rest_route=\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/mintea.blog\/index.php?rest_route=\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/mintea.blog\/index.php?rest_route=\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/mintea.blog\/index.php?rest_route=%2Fwp%2Fv2%2Fcomments&post=1888"}],"version-history":[{"count":2,"href":"https:\/\/mintea.blog\/index.php?rest_route=\/wp\/v2\/posts\/1888\/revisions"}],"predecessor-version":[{"id":1894,"href":"https:\/\/mintea.blog\/index.php?rest_route=\/wp\/v2\/posts\/1888\/revisions\/1894"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/mintea.blog\/index.php?rest_route=\/wp\/v2\/media\/1890"}],"wp:attachment":[{"href":"https:\/\/mintea.blog\/index.php?rest_route=%2Fwp%2Fv2%2Fmedia&parent=1888"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/mintea.blog\/index.php?rest_route=%2Fwp%2Fv2%2Fcategories&post=1888"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/mintea.blog\/index.php?rest_route=%2Fwp%2Fv2%2Ftags&post=1888"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}