K-Fold Cross Validation & Bootstrapping

Widely used approach for estimating test error. idea is to randomly divide the data into K equal sized parts. we leave out part k, fit the model to the other k-1 parts(combined) and then obtain predictions for the left-out Kth pair. This is done in turn for each part K =1,2,3….. K and then the results are combined. Since each training set is only (k-1)/K as big as the original training set the estimates of prediction error will typically be biased upward. The bias is minimized when K=n but this estimate has high variance.

Bootstrapping is a resampling method used in statistics to estimate the sampling distribution of a statistic by repeatedly resampling from the original data with replacement. It’s commonly used for estimating confidence intervals or assessing the variability of a statistic when the population distribution is unknown or difficult to model. Let’s say we have a dataset consisting of exam scores for a class of 50 students.

Bootstrapping involves creating multiple bootstrap samples by randomly selecting data points from the original dataset with replacement. Each bootstrap sample has the same size as the original dataset (50 in this case).Compute the statistic of interest (e.g., mean, median, standard deviation, etc.) for each of the bootstrap samples. For this example, let’s calculate the mean for each bootstrap sample. Repeat Resampling the data  and calculating the statistic a large number of times (typically thousands of times) to generate a distribution of the statistic. This distribution represents the variability of the statistic under different random samples. We can use the resulting bootstrap distribution to estimate confidence intervals or assess the variability of the statistic. For instance, you might calculate the 95% confidence interval for the mean exam score based on the bootstrap distribution.

RESAMPLING METHODS

These Methods refit a model of interest to samples formed from the training set, in order to obtain additional Information about the fitted model. For example they provide estimates of test set prediction error and the standard deviation and bias of our parameter estimates.

Distinction between the Test Error and Training Error:

Test error is the average error that results from using a statistical learning method to predict the response on a new observation one that was not used in training the method. In contrast the training error can be easily calculated by applying the statistical learning method to the observations used in its training. But the training error rate is often quite different from the test error rate and in particular the former can dramatically underestimate the latter.

Bias-Variance Trade-off:

Bias and variance together give us prediction error and there’s a trade-off they sum together to get prediction and the trade-off is minimized. so bias and variance gives us the test error.

Validation-Set Approach:

Here we randomly divide the available set of samples into two parts: a training set and a validation or hold -out set.

The model is fit on the training set and the fitted model is used to predict the response for the observations in the validation set. The resulting validation set error provides on estimate of the test error. This is typically assessed using MSE in the case of a qualitative response and misclassification rate in the case of qualitative (discrete) response.

Drawbacks of validation approach:

the validation estimate of the test error can be highly variable, depending on precisely which observations are included in the training set and which observations are included in the validation set . In the validation approach only a subset of the observations those that are included in the training set rather than in the validation set are used to fit the model. This suggests that the validation set error may tend to overestimate the test error for the model fit on the entire data set. why?  In general the more the data one has the lower the error.

Relation between Pre-Molt and Post-Molt

The relationship between the pre-molt and post-molt sizes of crabs using
statistical analysis. When we compare the histograms of crabs’ sizes pre-molt and post-molt side by side, we observe that the shape of the distributions is quite similar. The only notable distinction is a mean difference of  143.898 – 129.212=14.6858. The question is this difference in means statistically significant, To tackle this issue, we could employ a common statistical method known as a t-test. The estimated p-value, p = 0.0341998. With a p-value <0.05, we can conclude that we reject the null hypothesis that there is no real difference.

The primary use of a t-test is to assess whether there is a significant difference in the means of two populations. Furthermore, applying a t-test to compare two means using suitable software may not inherently provide a clear understanding of how the p-value was computed, For these reasons we carry out a Monte-Carlo Procedure to calculate a p-value for the observed difference in means while considering a null hypothesis that assumes no real difference. 472 post-molt data points and another set of 472 pre-molt data points. If we combine these two sets into one, resulting in a combined dataset of 944 points, and then randomly divide it into two separate buckets, namely Bucket A with 472 data points and Bucket B containing what remains, we can proceed to calculate the difference in means between these buckets. This process is repeated N times, and we keep a record of how many times n the difference in means is greater than or equal to 14.6858. The probability, denoted as P, is then calculated as P = n/N.

Linear Regression Model With More Than One Predictor Variable

We initially have a response variable Y and a simple linear regression mean function:

Y = β0 + β1 +ϵ

Now, let’s introduce a second variable  X2, and aim to understand how Y depends on both X1 and X2 simultaneously. By incorporating X2 into the analysis, we create a mean function that considers the values of both X1 and X2:

Y = β0 + β1 x1 + β2 x2 + ϵ

The primary objective in including X2 is to account for the portion of Y that hasn’t already been explained by X1.

% Diabetes(Predict) ← % Inactivity, % Obesity (Predictors or Factors)

The Generalized Linear Model  extends the concept of linear regression by introducing a link function that relates the linear model to the response variable and by permitting the measurement variance to be influenced by the predicted value of each measurement.

Breusch Pagan Test

A P-Value represents the likelihood of our data occurring randomly, and it’s crucial in deciding whether to accept or reject the null hypothesis. The Breusch-Pagan test, on the other hand, is used to detect heteroskedasticity. This test is running a regression where we predict the squared residuals from the initial regression model using predictor variables, and then evaluating the significance of these coefficients. If these coefficients significantly deviate from zero, it suggests the presence of heteroskedasticity. Based on the p-value, if we choose to opt the null hypothesis (H0), it implies that the data does not exhibit heteroskedasticity. Conversely, if we opt for the alternative hypothesis, it suggests the presence of heteroskedasticity in the data. If the p-value of the test falls below a specific significance threshold (e.g., α = .05), we reject the null hypothesis and infer that the regression model exhibits heteroscedasticity.

 

 

Linear and Multiple Linear Regression

Linear regression is a statistical technique that involves estimating the value of one variable based on the values of other variables. In the context of our class discussion, we explored how to predict the percentage of diabetes based on the percentage of inactivity alone, represented as (%diabetes = α + β %inactivity + ε), where % diabetes is considered the dependent variable, and % inactivity serves as the independent variable. We can also extend this approach to a multiple linear regression method, which incorporates more than one independent variable, such as (%diabetes = α + β1 %inactivity + β2 %obesity + ε).

When conducting multiple linear regression to predict %diabetes using both %inactivity and %obesity as independent variables, it’s important to note that we have a limited dataset comprising only 354 data points. In this scenario, we need to construct a model that describes the relationship between %diabetes and these two independent variables based on this relatively small dataset.