6.6. Exercises#

These exercises cover basic ML strategies, including regression, interpolation, hyperparameter tuning, and dimensionality reduction.

NOTE: These exercises cover only a small sample of ML techniques. All ML will be implemented in the scikit-learn Python package here. Documentation can be found at https://scikit-learn.org/stable/.

6.6.1. Exercise 1#

Attached to this exercise is an IR spectrum for ethanol. Use the pandas package to import the Excel (CSV) file as a Python dataframe.

# code here

Fit the data using a radial basis function kernel regression. This is a non-parametric model that performs intepolation. Set the Gaussian width (sigma) to 100 and train the model on all data points. Report the mean absolute error (MAE) of the model and plot the data with the fitted model on the same plot. Be sure to color data points used for model training differently than those that were only part of the test set.

# code here

Repeat the above task but only train the model on every third data point. What happens to the fit and to the MAE?

# code here

Lastly, repeat the fitting with a Gaussian width (sigma) of 1. Explain what you observe.

# code here

6.6.2. Exercise 2#

This exercise is a continuation of the previous and uses the same dataset.

First, use kernel ridge regression (KRR) to fit the IR data. KRR is simply kernel regression from Exercise 1 with an added L2 regularization term to control smoothness. How does this model compare to your model in Exercise 1? Test your model with a regularization strength (alpha) of 0.1 and 10. What happens as alpha goes to infinity? Why?

# code here

Next, try using L1 regularization (LASSO) in place of L2. Note that, unlike L2, L1 regularization will remove unnecessary features entirely. For this exercise, use sklearn’s GridSearchCV tool with 5-fold cross validation to automatically perform hyperparameter tuning. Test regularization strengths in the range alpha = [0.0001, 0.001, 0.01, 0.1, 1.0, 10]. Comment on what happends as alpha increases. Plot the best model with the original data.

# code here

6.6.3. Exercise 3#

This exercise will explore Principal Component Analysis as an unsupervised tool for dimensionality reduction. Begin by importing the diabetes toy dataset from the sklearn package. This dataset predicts a patient’s susceptibility to diabetes based on ten input features. Once imported, use feature scaling to normalize the features (stanadrd scaler).

https://scikit-learn.org/stable/modules/generated/sklearn.datasets.load_diabetes.html#sklearn.datasets.load_diabetes
https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.StandardScaler.html

# code here

Principal Component Analysis (PCA) works by orthogonalizing all features and by identifying features with the greatest amount of variance. Apply PCA to the diabetes feature set. Keep all 10 principal component features and plot their covariance matrix. What do you notice and why?

# code here

Use PCA to identify the minimum number of features needed to capture 90% of the data’s variance.

# code here

Utilize the function below and modify the code to create a loadings plot with the first two principal components. Identify which features from the original dataset may be unnecessary. This can be done because features with the same weights in principal component analysis behave identically and are therefore redundant.
HINT: Use the pca.components_ routine.

## Custom - made plotting to plot loadings 
def myplot(score,coeff,labels=None):
    xs = score[:,0]
    ys = score[:,1]
    n = coeff.shape[0]
    scalex = 1.0/(xs.max() - xs.min())
    scaley = 1.0/(ys.max() - ys.min())
    #plt.scatter(xs * scalex,ys * scaley, c = Y1)
    for i in range(n):
        plt.arrow(0, 0, coeff[i,0], coeff[i,1],color = 'r',alpha = 0.5)
        if labels is None:
            plt.text(coeff[i,0]* 1.15, coeff[i,1] * 1.15, "Var"+str(i+1), color = 'g', ha = 'center', va = 'center')
        else:
            plt.text(coeff[i,0]* 1.15, coeff[i,1] * 1.15, labels[i], color = 'g', ha = 'center', va = 'center')
plt.xlim(-1,1)
plt.ylim(-1,1)
plt.xlabel("PC{}".format(1))
plt.ylabel("PC{}".format(2))
plt.grid()
#Call the function. Use only the 2 PCs.