Code
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import sklearn as skMachine Learning
Source: CDC Heart Disease Facts
Examples are assigning a given email to the “spam” or “non-spam” class, and assigning a diagnosis to a given patient based on observed characteristics of the patient (sex, blood pressure, presence or absence of certain symptoms, etc.).
In this session we will see a very simple first example of executing this plan, to get a first impression without getting tangled in the details. But our focus, as indicated, will be in the first two steps pf the process. Then, in later sessions we will look into those details as we learn more about Machine Learning (ML) models and the whole ML process.
As we gain experience we will see that this plan is in fact a draft, and we will begin to analyze more elaborate strategies in ML.
A data table for the heart disease problem could be something like:
We now deal with binary classification and will return to the multiclass problem later. For binary classification many of the models we will study use a soft partition or probabilistic approach.
Let \(\Omega\) be our inputs space, so that each data point (sample) \((\mathbf x, y)\) is given by a point \(\mathbf x\) in \(\Omega\) and an output value \(y\in\{0, 1\}\).
Note: In each particular problem the output \(y\) can only take two different values: (yes, no), (true, false), (dog, cat) and so on. But sometimes it is useful to identify these values with (1, 0).
no get a prediction of class yes (false positives). We wiill return to this shortly, when discussing how to assess the quality of a classification model.The second step in our plan to solve the classification problem is itself a multistep process:
We will next see a first basic example of how to perform these steps using Python.
Simdata0.csv. This file contains a synthetic dataset with several input variables and one categorical output variable Y.Always begin opening every new data file in a text editor. Look for the presence of a header line, the separator used and any other relevant characteristic of the data file.
We will usually begin importing some of the standard Python Data Science libraries, as in
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import sklearn as skThen load the data and see the first lines of the table: run the code below (it may need some modification).
df = pd.read_csv('Simdata0.csv')
df.head(4)| id | X1 | X2 | X3 | X4 | Y | |
|---|---|---|---|---|---|---|
| 0 | 0 | -1.055092 | 2.079289 | 2.869735 | B | 1 |
| 1 | 1 | -0.219354 | 3.260428 | 2.025598 | B | 1 |
| 2 | 2 | 2.718957 | 1.709043 | -1.027776 | A | 1 |
| 3 | 3 | 0.143881 | 4.281243 | 1.331685 | B | 0 |
df.info()<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1000 entries, 0 to 999
Data columns (total 6 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 id 1000 non-null int64
1 X1 997 non-null float64
2 X2 998 non-null float64
3 X3 1000 non-null float64
4 X4 1000 non-null object
5 Y 1000 non-null int64
dtypes: float64(3), int64(2), object(1)
memory usage: 47.0+ KB
There seems to be missing values in X1 and X2.
We should always check the first few rows of the table to get a first impression of the data. Use the pandas method head for this:
df.head()| id | X1 | X2 | X3 | X4 | Y | |
|---|---|---|---|---|---|---|
| 0 | 0 | -1.055092 | 2.079289 | 2.869735 | B | 1 |
| 1 | 1 | -0.219354 | 3.260428 | 2.025598 | B | 1 |
| 2 | 2 | 2.718957 | 1.709043 | -1.027776 | A | 1 |
| 3 | 3 | 0.143881 | 4.281243 | 1.331685 | B | 0 |
| 4 | 4 | -0.191721 | 2.405672 | 2.214553 | B | 1 |
df.drop(columns="id", inplace=True) Always check the result:
df.head()| X1 | X2 | X3 | X4 | Y | |
|---|---|---|---|---|---|
| 0 | -1.055092 | 2.079289 | 2.869735 | B | 1 |
| 1 | -0.219354 | 3.260428 | 2.025598 | B | 1 |
| 2 | 2.718957 | 1.709043 | -1.027776 | A | 1 |
| 3 | 0.143881 | 4.281243 | 1.331685 | B | 0 |
| 4 | -0.191721 | 2.405672 | 2.214553 | B | 1 |
df.nunique()X1 997
X2 995
X3 1000
X4 2
Y 2
dtype: int64
output = 'Y'
cat_inputs = ['X4']
df[cat_inputs + [output]] = df[cat_inputs + [output]].astype("category")
df.info()<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1000 entries, 0 to 999
Data columns (total 5 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 X1 997 non-null float64
1 X2 998 non-null float64
2 X3 1000 non-null float64
3 X4 1000 non-null category
4 Y 1000 non-null category
dtypes: category(2), float64(3)
memory usage: 25.8 KB
inputs = df.columns.drop(output)
num_inputs = inputs.difference(cat_inputs).tolist()
print(inputs)Index(['X1', 'X2', 'X3', 'X4'], dtype='object')
from sklearn.preprocessing import OneHotEncoder
ohe = OneHotEncoder(sparse_output=False)
ohe_result = ohe.fit_transform(df[cat_inputs])
ohe_inputs = [ohe.categories_[k].tolist() for k in range(len(cat_inputs))]
ohe_inputs = [cat_inputs[k] + "_" + str(a) for k in range(len(cat_inputs)) for a in ohe_inputs[k]]
ohe_df = pd.DataFrame(ohe_result, columns=ohe_inputs)
df[ohe_inputs] = ohe_dfNote that the lists num_inputs and cat_inputs and the output are unaffected by this.
We can do a quick check of the number of missing values per column, to confirm what we saw before:
df.isnull().sum(axis=0)X1 3
X2 2
X3 0
X4 0
Y 0
X4_A 0
X4_B 0
dtype: int64
Dealing with missing values is usually non trivial. After identiying them we can either drop them or try to replace them with some meaningful values (imputation). We will discuss that later in the course. For now we just drop the table rows containing missing values. Note the inplace argument and check the result again with the preceding command after removing the missing data with this command:
df.dropna(inplace=True)
df.isnull().sum(axis=0)X1 0
X2 0
X3 0
X4 0
Y 0
X4_A 0
X4_B 0
dtype: int64
This step is a crucial part of the method that we use to assess model performance. We will have plenty of oportunities to go into details in the course, but the central idea is that we need to keep a part of the data (test set) hidden from the model, so that we can use it at the end to get a better mesaure of the model’s p erformance. We must do this in the early steps of preprocessing for this strategy to work.
X = df.drop(columns=output)
Y = df[output]The actual splitting is now accomplished with:
from sklearn.model_selection import train_test_split
XTR, XTS, YTR, YTS = train_test_split(X, Y,
test_size=0.2, # percentage preserved as test data
random_state=1, # seed for replication
stratify = Y) # Preserves distribution of yThe stratify = y option guarantees that the proportion of both output classes is (approx.) the same in the trainng and test datasets.
Outliers (also called atypical values) can be generally defined as samples that are exceptionally far from the mainstream of the data. Formally, a value \(x\) of a numeric variable \(X\) is considered an outlier if it is bigger than the upper outlier limit \[q_{0.75}(X) + 1.5\operatorname{IQR}(X)\] or if it is smaller than the lower outlier limit \[q_{0.25}(X) - 1.5\operatorname{IQR}(X),\] where \(q_{0.25}(X), q_{0.45}(X)\) are respectively the first and third quartiles of \(X\) and \(\operatorname{IQR}(X)\) is the interquartilic range of \(X\).
A boxplot graph, like the one on the left, of the variable is often the easiest wy to spot the pressence of outliers.
When one or more samples are suspected to be outliers, the first step is to make sure that the values are valid and that no data recording errors have occurred.
With small sample sizes, apparent outliers might be a result of a skewed distribution where there are not yet enough data to see the skewness.
Finally, there are “True” informative outliers.
Just as we saw with missing data, dealing with outliers is not trivial (even when we are able to separate apparent from real outliers). Sometimes we will have to check their influence in the model before deciding how to proceed.
In simple cases, where the outliers represent a very small fraction of the training data, we will simply remove them. But keep in mind that this is not always a good idea. In fact, the removal of all outliers often leads to the appearance of new outliers.
XTR_numeric_boxplots = XTR[num_inputs].plot.box(subplots=True,
layout=(1, len(num_inputs)),
sharex=False, sharey=False, figsize=(6, 3))The plot reveals some clearly atypical values in X1, X3, while some other outliers closer to the boxes can be considered as simply belonging to the tails of the distributions for the variables.
from matplotlib.cbook import boxplot_stats
def explore_outliers(df, num_vars):
fig, axes = plt.subplots(nrows=len(num_vars), ncols=1, figsize=(7, 3), sharey=False, sharex=False)
fig.tight_layout()
outliers_df = dict()
for k in range(len(num_vars)):
var = num_vars[k]
sns.boxplot(df, x=var , ax=axes[k])
outliers_df[var] = boxplot_stats(df[var])[0]["fliers"]
out_pos = np.where(df[var].isin(outliers_df[var]))[0].tolist()
out_idx = [df[var].index.tolist()[ k ] for k in out_pos]
outliers_df[var] = {"values": outliers_df[var],
"positions": out_pos,
"indices": out_idx}
return outliers_df# %load "../exclude/code/2_1_Exercise_003.py"In this step we will look into the distribution of each of the input variables and at the same time we will study the possible relations between them. This includes relations between the input variables by themselves and also the relation of the inputs with the output \(Y\). As a result of this Exploratory Data Analysis (EDA) we will decide if we need to apply a transformation or to create new input variables and, given the case, whether to select a smaller set of inputs. This last part belongs to the set of techniques called Feature Selection and Engineering.
In this section we will discuss several topics related to this kind of analysis:
EDA
describe function.Feature Selection and Engineering
We will also see that Scikit-learn provides a pipeline mechanism to ease the implementation of Feature Engineering and Modeling.
describe function.
This pandas function provides numerical summaries for the variables in a dataframe. If applied directly the output only considers numerical variables. Thus, it is best to deal with numerical and factor inputs separately.
For numeric inputs (we transpose the resulting table because we have a small number of variables) we get standard statistical summaries:
XTR[num_inputs].describe().transpose()| count | mean | std | min | 25% | 50% | 75% | max | |
|---|---|---|---|---|---|---|---|---|
| X1 | 783.0 | 0.959919 | 0.926650 | -1.395655 | 0.319030 | 0.961506 | 1.569518 | 3.401924 |
| X2 | 783.0 | 3.157209 | 1.965182 | -2.043726 | 1.857064 | 3.157723 | 4.550626 | 8.355793 |
| X3 | 783.0 | 1.038966 | 0.958495 | -1.349988 | 0.421874 | 1.049817 | 1.668060 | 3.490106 |
For input factors we obtain basic information (number of distinct values, the mode and its frequency):
XTR[cat_inputs].describe().transpose()| count | unique | top | freq | |
|---|---|---|---|---|
| X4 | 796 | 2 | B | 414 |
A pairplot for a set of variables \(V_1, \ldots, V_j\) is a set of plots arranged in a rectangular (or triangular) grid, describing the distribution and relations between those variables. Here we use seaborn pairplot function to draw a pairplot for the numerical inputs.
sns.set_style("white")
plt_df = XTR[num_inputs].copy()
plt_df["YTR"] = YTR
sns.pairplot(plt_df, hue="YTR", corner=True, height=2)corner argument removes the (redundant) plots above the diagonal.hue argument to include the information about the outplut classes into the plots (different colors for each class).In our example the pairplot shows these remarkable features of the dataset:
The following code will plot parallel boxplots for each combination of a numerical input and a factor input. This information is useful for feature selection and engineering.
for numVar in num_inputs:
catvar_num = 0
if len(cat_inputs) > 1:
fig, axes = plt.subplots(1, len(cat_inputs)) # create figure and axes
print(f"Analyzing the relation between factor inputs {cat_inputs[catvar_num]} and ", numVar)
for col, ax in zip(cat_inputs, axes): # boxplot for each factor inpput
catvar_num += 1
sns.boxplot(data=XTR, x = col, y = numVar, ax=ax)
# set subplot margins
plt.subplots_adjust(left=0.9, bottom=0.4, right=2, top=1, wspace=1, hspace=1)
plt.figure(figsize=(1, 1))
plt.show()
else:
print(f"Analyzing the relation between factor input {cat_inputs[0]} and ", numVar)
sns.boxplot(data=XTR, x = cat_inputs[0], y = numVar)
plt.show()The correlation matrix completes the information provided by the pair plot. The values in the diagonal are of course all 1.
XTR[num_inputs].corr()| X1 | X2 | X3 | |
|---|---|---|---|
| X1 | 1.000000 | 0.026972 | -0.964959 |
| X2 | 0.026972 | 1.000000 | -0.020874 |
| X3 | -0.964959 | -0.020874 | 1.000000 |
Direct inspection of the correlation matrix is often not good enough for datasets with many inputs. The code below plots a version of the correlation matrix that uses a color scale to help locate interesting correlations.
plt.figure()
plt.matshow(XTR[num_inputs].corr(), cmap='viridis')
plt.xticks(range(len(num_inputs)), num_inputs, fontsize=14, rotation=45)
plt.yticks(range(len(num_inputs)), num_inputs, fontsize=14)
cb = plt.colorbar()
cb.ax.tick_params(labelsize=14)
plt.title('Correlation Matrix', fontsize=16)
plt.show()
The correlation matrix completes the information provided by the pair plot. And in the case of very high values of correlation we can decide to drop one of the variables. The rationale behind this is that two highly correlated variables contain essentially the same information. Besides, some models will suffer numerical stability issued when fed with highly correlated (colinear) variables.
In our case it confirms the very strong (negative) correlation between \(X_1\) and \(X_3\). Thus we will drop e.g. \(X_3\)
XTR.drop(columns="X3", inplace=True)
num_inputs.remove("X3") # Keep the list of inputs updated There is not a universally accepted threshold that we can use to define a correlation to be high enough. EDA can be our guide in this. And later in the course we will discuss methods for systematic feature selection. Meanwhile, and in simple cases such as our example, this manual approach is enough. But in more complex situation we can set a correlation threshold and use the following heuristic:
Later in the course we will discuss PCA (principal component analysis) and other dimensionality reduction techniques that can be used to perform feature selection even for hundreds or thousands of inputs.
Many Machine Learning algorithms are affected by the scale of the predictors or require them to be (at least approximately) normal. Therefore we will sometimes apply transformations to the data. A typical example is standardization where an input variable X is transformed as: \[
X^* = \dfrac{X -\bar X}{s_x}
\] where \(\bar X\) is the mean of \(X\) and \(s_x\) its (sample) standard deviation. Similar transformations are used to ensure that all values belong to the \([0, 1]\) interval, etc.
This uniparametric family of transformations is used to increase to bring the distribution of a variable closer to normality. It is defined by: \[ X^* = \begin{cases} \dfrac{X^{\lambda} - 1}{\lambda} & \text{ if }\lambda\neq 0 \\[3mm] \log(X) & \text{ if }\lambda = 0 \end{cases} \]
Sometimes adding new variables increases a model’s performance. The scatterplot for \(X_1\) and \(X_2\) showed a clear non-linear boundary between the output classes, like a parabola. A model with non linear functions of \(X_1\) as inputs, such as \(X_1^2\), can do a better job at finding this boundary. We will explore this in future sessions, along with the concept of interaction between inputs. And to account for interactions we will also add new variables to the data, computed from the original ones.
We are approaching the end of preprocessing after finishing graphical EDA. We kept the original categorical inputs for this, but we should now drop them.
XTR.drop(columns=cat_inputs, inplace=True)Many preprocessing steps (i.e. variable transformations or new variables addition) benefit from the pipeline framework in scikit-learn. This also holds for the subject of the next paragraph: the sampling methods used to deal with imbalance. So we defer the discussion of pipelines until we have introduced imbalance.
A classification problem is called imbalanced when there is a significative difference between the proportion of the output variable classes. In such cases we use the terminology majority class (or classes in multiclass problems) and minority class (or classes) to distinguish between classes that take up a bigger proportion of the dataset and those who do not.
Severe imbalance, say like 90% of the majority class, is a challenge fpr classification algorithms because the model can take the easy way out of always predicting the majority class. And then its predictions will be correct in approximately 90% of the cases, for free!
There are several ways in which this issue can be addressed. The common ones are:
We look at the relative frequencies of the classes for the putput variable in the training set.
YTR.value_counts(normalize=True)1 0.651341
0 0.348659
Name: Y, dtype: float64
With a 65% vs 35% proportion this dataset is certainly not balanced. But this is an instance of what is considered mild imbalance. Therefore, we will not deal with imbalance right now for this particular example.
We will later seer that pipelines are a general tool that can be used to describe much of the modeling process. But to begin working with them we will first focus in their use for preprocessing. Essentially a pipeline is a description of the steps we want Python to follow when processing the data.
For our first example we will consider a simple pipeline with just one step, in which the numeric variables are standardized. First we do the required imports.
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.compose import ColumnTransformerAnd now we provide a description of the standardization step and give it a name:
num_transformer = Pipeline(
steps=[("scaler", StandardScaler())]
)We want to apply this transformer only to the numeric variables, leaving the (one hot encoded) categorical features unchanged. Let us see how we do that.
ColumnTransformer
This is where ColumnTransformer comes into play: it can be used to combine transformers that affect separate sets of columns into a single global transformer. In this case we want to use our num_transformer for the variables in num_inputs. And to indicate that the remaining categorical inputs (those in ohe_inputs) are not transformed at all we use the string "passthrough" that ColumnTransformer reads as ‘do nothing’.
preprocessor = ColumnTransformer(
transformers=[
("num", num_transformer, num_inputs),
("cat", "passthrough", ohe_inputs),
]
)With the transformer ready we use it to make a pipeline with a single step, named prep (as in preprocessing). We have selected pandas dataframe as output format for easier visualization.
preproc_pipe = Pipeline(steps=[('prep', preprocessor)])
_ = preproc_pipe.set_output(transform="pandas")To apply this pipeline to the data we have to fit and transform the data with the pipeline. You may conceptually think of the fit part as computing the mean and deviation needed for the standardization, and the second part where we actually transform the data. They can be done separately but for user convenience there is a method called fit_transform that we now apply to XTR. We will see what the output looks like:
preproc_XTR = preproc_pipe.fit_transform(XTR)
preproc_XTR.head()| num__X1 | num__X2 | cat__X4_A | cat__X4_B | |
|---|---|---|---|---|
| 330 | -1.978462 | 0.368036 | 0.0 | 1.0 |
| 461 | 0.729319 | -0.344977 | 1.0 | 0.0 |
| 945 | 0.161296 | 0.344238 | 1.0 | 0.0 |
| 248 | 0.779554 | -0.052410 | 1.0 | 0.0 |
| 398 | -0.381772 | -0.837718 | 0.0 | 1.0 |
Note that the names of the variables in the transformed data set reflect their path through the pipeline.
Now that we have seen how to transform the trainig set it is nonly natural to apply the preprocessing to the test set as well. This is straightforward.
preproc_XTS = preproc_pipe.fit_transform(XTS)
preproc_XTS.head()| num__X1 | num__X2 | cat__X4_A | cat__X4_B | |
|---|---|---|---|---|
| 221 | 0.221569 | 1.145371 | 1.0 | 0.0 |
| 455 | 0.624914 | 0.365996 | 1.0 | 0.0 |
| 184 | -0.067670 | -1.902537 | 0.0 | 1.0 |
| 392 | -0.752972 | 0.412464 | 0.0 | 1.0 |
| 762 | 1.273299 | -0.282722 | 1.0 | 0.0 |
The names of the transformed variables match those in the training set as expected.
Can we recover the training set from the preprocessed data? Indeed we can and it will give us a chance to get acquainted with the structure of pipelines. This is a more complex section than most of the preceding ones. So do not worry if you do not really get it the first time! You can return here once you have gained experience working with pipelines.
We need two ingredients. First we need to get to the (already fitted) scaler transformer. We can do it like this:
scaler = preproc_pipe["prep"].transformers_[0][1]We recommmend you to examine the right hand side incrementally. That is, execute first preproc_pipe["prep"] in a cell and see its output. Then proceed to preproc_pipe["prep"][0] and do the same, and so on until you get the full picture. The second ingredient is simpler, we need the names of the variables we are inverse-transforming:
num_inputs_preproc = preproc_XTR.columns[[0, 1]]
num_inputs_preprocIndex(['num__X1', 'num__X2'], dtype='object')
Now we can apply the inverse_transform method of the scaler to the transformed variables. Watch out: the format of the result is a numpy array!
scaler_inv = scaler.inverse_transform
scaler_inv(preproc_XTR[num_inputs_preproc])array([[-1.03940879, 3.63512679],
[ 1.82738029, 2.19098463],
[ 1.22600163, 3.58692705],
...,
[ 0.14699154, 5.87695104],
[ 0.33667398, 1.20710585],
[ 1.87704173, 3.67005701]])
Now we have a preprocessed data set, and we are ready to move to the last three steps of our plan to solve a classification problem:
In the next sessions we will study some important families of models: logistic regression and KNN models.