K-Fold Cross Validation & Bootstrapping

Widely used approach for estimating test error. idea is to randomly divide the data into K equal sized parts. we leave out part k, fit the model to the other k-1 parts(combined) and then obtain predictions for the left-out Kth pair. This is done in turn for each part K =1,2,3….. K and then the results are combined. Since each training set is only (k-1)/K as big as the original training set the estimates of prediction error will typically be biased upward. The bias is minimized when K=n but this estimate has high variance.

Bootstrapping is a resampling method used in statistics to estimate the sampling distribution of a statistic by repeatedly resampling from the original data with replacement. It’s commonly used for estimating confidence intervals or assessing the variability of a statistic when the population distribution is unknown or difficult to model. Let’s say we have a dataset consisting of exam scores for a class of 50 students.

Bootstrapping involves creating multiple bootstrap samples by randomly selecting data points from the original dataset with replacement. Each bootstrap sample has the same size as the original dataset (50 in this case).Compute the statistic of interest (e.g., mean, median, standard deviation, etc.) for each of the bootstrap samples. For this example, let’s calculate the mean for each bootstrap sample. Repeat Resampling the data  and calculating the statistic a large number of times (typically thousands of times) to generate a distribution of the statistic. This distribution represents the variability of the statistic under different random samples. We can use the resulting bootstrap distribution to estimate confidence intervals or assess the variability of the statistic. For instance, you might calculate the 95% confidence interval for the mean exam score based on the bootstrap distribution.

Leave a Reply

Your email address will not be published. Required fields are marked *