8.7. Exercises#
These exercises cover basic ML strategies, including classification and generative models.
NOTE: These exercises cover only a small sample of ML techniques. All ML will be implemented in the scikit-learn Python package here. Documentation can be found at https://scikit-learn.org/stable/.
8.7.1. Exercise 1#
This exercise will use Supper Vector Machines to build a classification model for a 2-class dataset with a dimensionality of 2. Begin by creating a moons toy dataset with 200 samples and a noise of 0.1 using the sklearn.datasets.make_moons function: https://scikit-learn.org/stable/modules/generated/sklearn.datasets.make_moons.html
Visualize the dataset and color each point according to its assigned class.
# code here
from sklearn.datasets import make_moons
import matplotlib.pyplot as plt
import numpy as np
import matplotlib as mpl
%matplotlib inline
Train a support vector machine with a radial basis function (RBF) kernel to create a discriminative classification model for this dataset. Tune the gamma and C parameters to achieve a minimum accuracy of 90% on your test set. Use an 80/20 train/test split.
Comment on what happens as C goes to 0 and as C goes to infinity.
Plot the data colored by the true class and show the SVM decision boundary. Calculate the accuracy, precision, and recall for your prediction.
# code here
from sklearn.model_selection import train_test_split
from sklearn.svm import SVC
from sklearn.metrics import accuracy_score, precision_score, recall_score
def create_and_train_model(X_data:tuple, y_data:tuple, random_state:int=42, C:float=1.0, kernel:str='rbf', gamma:float=1.0):
def get_scores(model, X_test, y_test):
Response
Accuracy diminishes as C goes to 0, improves (remains the same) as C goes to infinity.
Show the confusion matrix for your best-performing model.
# code here
from sklearn.metrics import confusion_matrix, ConfusionMatrixDisplay
8.7.2. Exercise 2#
This exercise is a continuation of the previous and uses the same dataset.
Create an random 80/20 train/test split for the moons dataset from above. For k in [2,5,10,20], use a k-nearest neighbors (kNN) model for classification. Use subplots to show the result with each point colored according to its predicted class for each value of k. Compute the accuracy for each case.
Laslty, discuss what is happening when the fit
method is called. Does kNN have a loss function? How are classes assigned? Is this supervised or un-supervised ML?
# code here
from sklearn.neighbors import KNeighborsClassifier
Response
This method works via democratizing class assignment. Classes are assigned based on the number of neighbor votes for the k nearest neighbors in the “training” set for which class assignment is known. A simple majority wins, and ties can be dealt with by decreasing k until the tie is broken. There is no loss function, but this still counts as supervised ML since the target labels are known (and used) in model “fitting.”
8.7.3. Exercise 3#
This exercise will focus on multi-class classification. Use the sklearn make_blobs function to generate a toy dataset with 150 samples, 4 clusters, and a dimensionality of 10.
# code here
from sklearn.datasets import make_blobs
Next, use PCA to reduce the dimensionality to 2. Plot the dataset with each point colored according to its class.
# code here
from sklearn.decomposition import PCA
Train a decision tree model (discriminative) model on the reduced dataset and plot the result with points colored according to their predicted class. Be sure to perform hyperparameter tuning as necessary; you can decide the extent to which this is necessary. Compute the accuracy, precision, and recall scores.
# code here
from sklearn.tree import DecisionTreeClassifier
# Function that trains a decision tree classifier
def train_tree(X_train, y_train, max_depth):
Repeat the above task with a naive Gaussian model (generative).
# code here
from sklearn.naive_bayes import GaussianNB
# Function that will train a Gaussian Naive Bayes classifier
def train_gnb(X_train, y_train):
8.7.4. Exercise 4#
This exercise focuses on unsupervised generative models, which can be used to generate new data based on patterns learned from existing data. Begin by importing the digits dataset from sklearn (also called the MNIST dataset). This is a high-dimensionality dataset consisting of hand-drawn digits (0–9) commonly used for training image processing systems. It consists of 1,797 samples, each corresponding to one of the ten digits. Each sample can be presented by an 8 x 8 grid of pixels, where each pixel is colored according to its corresponding hand-drawn image. Thus, the dimensionality is 64.
# Import digits dataset
from sklearn.datasets import load_digits
Create a Gaussian mixture model (GMM) to create three new examples of the digit 6. You can use the y_mnist variable to quickly select this subset (y_mnist == 6). Show each of your predictions. Use spherical covariance and toy with the number of mixture components to create a model that can generative reasonably convincing synthetic data.
# Import Gaussian Mixture Model library
from sklearn.mixture import GaussianMixture
# Function to create and train a GMM with spherical covariance and variable number of mixture components
def train_GMM(X, n_components):
# Function that will plot the 64 dimensional data as a 8x8 image
def plot_image(data, ax=None):
if ax is None:
fig, ax = plt.subplots()
ax.imshow(data.reshape(8,8), cmap='gray')
ax.axis('off')
return ax
Create a kernel density estimation (KDE) model to create three new examples of the digit 6. Show each of your predictions. Use a Gaussian kernel and toy with the bandwidth to create a model that can generative reasonably convincing synthetic data. Compare the KDE model to the GMM model.
# Import kernel density estimation from sklearn
from sklearn.neighbors import KernelDensity
# Function to train a KDE model
def train_kde_model(data, bandwidth=0.2, kernel='gaussian'):
Create three synthetic examples of the digit 2 using a generative model of your choice.
# code here