darkseagreen-chicken-141904.hostingersite.com

Tag: DataOps

Implementing a decision tree using Python and R
In one of our articles, we have already discussed basic concepts hidden behind the decision trees, including the definitions of the decision trees, other core concepts and terminology we use with the algorithm.

As we have already discussed all the theoretical parts of the decision tree, we now need to understand how we can use this model practically. This article will be an extension of the above-given article, where we will discuss the implementation of a decision tree using the python and R programming languages. This article will cover the following topics:

Table of Contents

Implementation of Decision Tree using the Python Programming Language
1. Data splitting
2. Importing and Fitting the Decision Tree Model
3. Model Evaluation
Implementation of Decision Tree using the R Programming Language
1. Data splitting
2. Importing and Fitting the Decision Tree Model
3. Model Evaluation
Implementation of a decision tree using the python programming language

To complete this motive of ours, we will take the help of the sklearn python library that will not only help us in fitting the model on data but also help in importing the iris data.

With the iris data, we get the four continuous variables that include sepal length, sepal width, petal length, and petal width of the iris flowers and based on these variables or features of the data. Iris flowers are separated into three categories: Iris Setosa, Iris Versicolour, and Iris Virginica. Let’s import the data sets.

from sklearn import datasets

data = datasets.load_iris()

X = data.data

y = data.target

print(‘independent variables name n’, data.feature_names)

print(‘shape of independent variables n’, X.shape)

print(‘class names in target variables n’,data.target_names)

print(‘shape of target variables n’, y.shape)

Output:

In the data, we get 150 data points and four variables as discussed above.

Now to model this data using a decision tree, we will use the following steps:
- Data splitting
- Importing and fitting the decision tree model
- Model evaluation
Let’s start with data splitting.

Data splitting

This step makes two sets of data ( train and test). Using the train set, we will train a decision tree mode and using the test set, we will evaluate the trained model. Let’s split the data.

from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, random_state = 0)

Let’s check the shape of the spilted sets

print(“shape of train data”, X_train.shape, y_train.shape)

print(“shape of test data”, X_test.shape, y_test.shape)

Output:

Importing and Fitting the Decision Tree Model

This step will let us know how to fit the decision tree model on data. The point to be noticed here is that the model from sklearn takes a NumPy array form of data to train the model. Also, calling the data from the sklearn library comes as a NumPy array, so here we are not required to worry about any transformation. We can directly fit the split data. Let’s import and train the model.

from sklearn import tree

clf = tree.DecisionTreeClassifier(random_state=0).fit(X_train, y_train)

The above code has called and trained the model using the train data. We can plot this tree to see how the splits worked with the iris data.

import matplotlib.pyplot as plt

plt.figure(figsize=(12, 10

tree.plot_tree(clf, feature_names= data.feature_names)

plt.show()

Output:

Here we can see that the in the root node of the decision tree, if the value of petal width is below or equal to 0.8, then iris has a class, and there are 37 samples of such data in whole train data. If the petal width is larger than 0.8 cm, then the iris flower is of a different class.

Let’s make predictions using the test data.

prediction = clf.predict(X_test)

Here in the prediction variable, we have values predicted by the model for the test data. Now, we can evaluate our model using the prediction set against the true values.

Model Evaluation

This section will use the accuracy score, f1_score and confusion matrix to evaluate the model. But, first, their definition is explained below.

Accuracy score: This gives the results based on the calculation of how many right predictions are made by the model compared to real data.

F1_score: This gives the harmonic mean of the precision and recall. Where precision can be interpreted as the right predicted positive values that belong to the positive class, and recall can be interpreted as the number of positive predicted values made out of all positive examples in the dataset. Mathematically,

F1 = 2 * (precision * recall) / (precision + recall)

Confusion matrix: this matrix represents how the model predicted the values in the below-given form of the matrix.

Let’s calculate the above-defined scores and matrix.

from sklearn.metrics import accuracy_score, f1_score, confusion_matrix

print(‘confusion matrix n’, confusion_matrix(y_test, prediction

print(‘accuracy score of our model n’, accuracy_score(y_test, prediction

print(‘f1 score of our model n’,f1_score(y_test, prediction, average = ‘micro’

Output:

Here we can see that there is only one value that the model has predicted wrong, and it has achieved a 97 % accuracy with a similar f1_score. Here this implementation is completed, and in the next section, we will perform the same operations using the R programming language.

Implementation of a Decision Tree using the R programming language

To work with the same data in the R programming language, we can use the datasets library. Using the below codes, we can get the Iris data.

ibrary(datasets)

data(iris)

head(iris)

Output:

Here we can see what how exactly our data looks like. Now we will follow the same steps as we followed using the Python programming language.

Data splitting

To complete this step, we will use the caTools library.

library(caTools)

sample_data = sample.split(iris, SplitRatio = 0.8)

train_data <- subset(iris, sample_data == TRUE)

test_data <- subset(iris, sample_data == FALSE)

Here we have split the data into an 80/20 ratio, where 80% of the data is from training, and 20% is for testing the model.

Importing and Fitting the Decision Tree Model

To complete this step, we will use the rpart library that allows us to fit the decision tree to any data. Using the below codes, we can call and train the model.

library(rpart)

clf <- rpart(formula = Species ~., data = train_data,

method = “class”,

control = rpart.control(cp = 0),

parms = list(split = “information”

Let’s check the model by plotting it.

library(rpart.plot)

prp(clf, extra = 1, faclen=0, nn = T,

box.col=c(“green”, “red”

Output:

One thing which we can also do here is to use the caret library so that we can check the importance of the feature/variable of our data in data modelling.

library(caret)

importances <- varImp(clf)

importances

Output:

Here we can see that the petal width is the most important variable in the training of the decision tree model.

Let’s make predictions from the model.

prediction <- predict(clf, newdata = test_data, type = “class”)

prediction

Output:

This is how our model has predicted on the test data.

Model Evaluation

Using only one line of codes we can evaluate our model against various matrices.

confusionMatrix(test_data$Species, prediction)

Output:

Here we have got most of the statistics which can be utilised to evaluate the model and we can also see that model has predicted only 1 wrong values and the accuracy of the model is around 97%.

Final words

The decision tree can be interpreted as an excellent introductory model to the tree-based model family. We can also find its uses as a common baseline model for various models like random forest and gradient boosting.

This article has looked at how we can implement a decision tree model using the python and R programming languages. With this, we have also looked at how we can draw and evaluate the model. Shortly we are going to cover all such kinds of models and concepts of machine learning and data science. To get all information, you can keep yourself connected to this link.

References
- Link to python code
- Link to R code
About DSW

Data Science Wizards (DSW) is an Artificial Intelligence and Data Science start-up that primarily offers platforms, solutions, and services for making use of data as a strategy through AI and data analytics solutions and consulting services to help enterprises in data-driven decisions.

DSW’s flagship platform UnifyAI is an end-to-end AI-enabled platform for enterprise customers to build, deploy, manage, and publish their AI models. UnifyAI helps you to build your business use case by leveraging AI capabilities and improving analytics outcomes.

Connect us at contact@datasciencewizards.ai and visit us at darkseagreen-chicken-141904.hostingersite.com/
October 6, 2022
Getting Started with Machine Learning Models: Polynomial Regression
In a series of articles, we have already discussed how linear and logistic regression works. In this article, we will discuss the polynomial regression model. These models are pretty similar to linear regression because we use them for regression modelling as we use linear regression. The flexibility of the regression line makes the model different from linear regression, or we can say that this model uses a curve to model the data points.

There are various cases in real life where we don’t find linear regression useful because data doesn’t have a linear relationship between its variables but has a non-linear relationship. In such cases, the polynomial regression algorithm can be a valuable tool for modelling data.

In this article, we are going to discuss the following points on polynomial regression:
- What is polynomial regression?
- Why do we use polynomial regression?
- Linear regression vs polynomial regression
What is Polynomial Regression?

We can consider polynomial regression as a special case of linear regression model capable of predicting the relationship as an nth degree polynomial of a form of linear regression that can model the non-linear data points using some polynomial terms.

To understand its working, we can take an example of a dataset with two variables, where X is the independent and Y is the dependent variable. Here when we feed the data in a model, we transform the input variable into polynomial terms using the degree n, then we find X⁰, X¹, X²…..X^n, and these polynomial terms help to interpret the non-linear relationship between the variables. The mathematics behind the polynomial regression model is similar to the linear regression, and Mathematically, the polynomial regression equation can be represented as

We can also say that the degree of order for this model is a Hyperparameter, and when modelling with such a model, we need to choose it wisely. One major problem with such a model is that with a higher degree of polynomial tries to overfit, and for smaller values of a polynomial, it tries to underfit. So finding the optimal degree of the polynomial is the only challenging part of such modelling.

Right Degree of Polynomial can be Found Using Two Ways:
- Forward selection: using this method, we increase the degree until we don’t find a best-fit model or optimal model.
- Backward selection: using this method, we decrease the degree until we don’t find the best-fit or optimal model.
Before applying the polynomial regression model, we need to consider some of its assumptions of it as stated below:

Assumptions of Polynomial Regression
- The behaviour of the dependent variable should be dependent(linear, curvilinear, additive etc.) on the relationship between the dependent and independent variables.
- Independent variables need to be independent of each other.
- Errors should be independent and normally distributed with constant variance and mean zero.
Note: when the degree of the polynomial in polynomial regression is one, the model is comparable to the linear regression model.

Why do we use Polynomial Regression?

When we look back at the last article, we find linear regression can be applied to datasets where values are linear, as given in the image below.

When we take examples of real-life datasets, we often find that values are not linearly separable. For example, data collected for salaries of different employees of different departments represents uneven variance. In such a situation dataset can look like the following:

Let’s say using such data points, we have drawn a regression model as given below:

Using this model, we predicted the salary of 6.5 years of experienced employees. Where we find that model predicts it is somewhere between 40,000 and 60,000. But looking at the data points, we can easily say that it is around 20,000. To fill this lag in prediction, we use the polynomial regression models.

Here we got to know whenever performing regression on data with a nonlinear relationship between variables, and we can use polynomial regression. Now let’s check how we can implement polynomial regression using python-programming language while comparing it with linear regression.

Linear Regression vs Polynomial Regression

In the above sections, we discussed polynomial regression and where we can use it. In this section, we will compare the effect of applying the linear regression model and the polynomial regression model on non-linear data. To do so, we will generate random data using the NumPy library. Let’s start with making data:

import numpy as np

X = np.arange(0, 13, 0.1).reshape(-1, 1)

Y = np.sin(X).ravel()

print(X)

Output:

Here in the above, we have made data in the sine form so that both variables can follow the below relationship.

Y = sin(X)

These variables don’t have linear-relationship. Let’s plot the data to verify this.

import matplotlib.pyplot as plt

plt.figure(1)

plt.axhline(y=0, color=’k’)

plt.grid(True)

plt.scatter(X,Y, color = ‘blue’)

plt.show

Output:

Here we can see that the data we have generated is following the sine-waveform. Now, let’s split the data into test and train data.

from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X,Y, test_size = 0.2, random_state = 42)

Now let’s fit the linear regression model and check the performance.

from sklearn.metrics import r2_score

from sklearn.linear_model import LinearRegression

model1 = LinearRegression().fit(X_train,y_train)

y_pred = model1.predict(X_test)

print(r2_score(y_test,y_pred

Output:

Here we can see that we got a bad R-square value and RMSE using this model. let’s plot the predictions of the model.

plt.plot(X_test,y_pred, color = ‘g’)

plt.plot(X_test, y_test, “b.”)

Output:

In the above, we can clearly see how badly linear regression is performing on our data.

Now, let’s Model the data using Polynomial Regression.

In Python, there are various ways to implement polynomial regression. Still, in this article, we will use Sklearn’s PlonomialFeatures function to transform the data into polynomial form then we will model transformed data using the linear regression model.

Let’s start the procedure by transforming the data.

from sklearn.preprocessing import PolynomialFeatures

poly = PolynomialFeatures(degree = 5, include_bias = True)

X_train_transf = poly.fit_transform(X_train)

X_test_transf = poly.transform(X_test)

Let’s fit the linear regression model on transformed data and check the performance.

model2 = LinearRegression().fit(X_train_transf,y_train)

y_pred = model2.predict(X_test_transf)

print(‘R-square score=>’,r2_score(y_test,y_pred

print(‘RMSE for Polynomial Regression=>’,np.sqrt(mean_squared_error(y_test,y_pred)

Output:

Here we can see that our R-squared value is also improved by minimising the RMSE for this type of regression. Let’s plot the results.

plt.plot(X_test, y_pred, “r.”, linewidth=2, label=”Predictions”)

plt.plot(X_test, y_test, “b.”)

plt.xlabel(‘Predictions’,fontsize=12)

plt.ylabel(‘Target’,fontsize=12)

plt.legend([‘predictions’,’original’])

Output:

Here, we are getting predictions closer to the original values. To clarify, we can make another data and predict for that.

Creating another sine-wave data and using the above-trained polynomial regression model for predicting new values.

X_test = np.arange(0, 8, 0.1).reshape(-1, 1)

Y_test = np.sin(X_test).ravel()

poly = PolynomialFeatures(degree = 5, include_bias = True)

X_test_trans = poly.fit_transform(X_test)

y_pred_new = model2.predict(X_test_trans)

Let’s plot the predicted and actual values again.

plt.plot(X_test, y_pred_new, “r.”, linewidth=2, label=”Predictions”)

plt.plot(X_test, Y_test, “b.”)

plt.xlabel(‘Predictions’,fontsize=12)

plt.ylabel(‘Target’,fontsize=12)

plt.legend([‘predictions’,’original’])

Output:

Looking at the above scenarios, we can say that our linear model is improved from the plane linear regression models.

Final Words

In this article, we discussed the introduction to the polynomial regression model. We can think of it as an improvement on a simple linear regression model we use for regression modelling when dependent and independent variables are not linearly related but curvilinearly related. To check its capabilities, we used sine-waved data and tried to model it using linear and polynomial regression models. We found that at polynomial degree 5, we got a better fit model than the linear regression model.

References
- Link to the codes
September 17, 2022
End-to-End Logistic Regression Modelling
In machine learning, Logistic regression algorithms are one of those basic models from which beginners start to learn classification modelling. Moreover, these algorithms are useful for modelling binary classification data. In one of our articles, we have already discussed how this algorithm work and how we should process it using synthetic data. In this article, we are going to use this algorithm with real-life datasets so that we can cover the following topics:
- Data Exploration
- Assumptions
- Modelling
1. Data processing
2. Forward feature selection
- Evaluation
1. Confusion matrix
2. Evaluation statistics
3. ROC Curve
4. Area Under the Curve
Let’s start with gathering data. For this article, we will use a heart disease data set that shows us how different factors make a person diagnosed with heart disease. Let’s import this dataset in the form of pandas DataFrame.

import warnings

warnings.filterwarnings(‘ignore’)

import pandas as pd

data = pd.read_csv(‘https://raw.githubusercontent.com/TarekDib03/Analytics/master/Week3%20-%20Logistic%20Regression/Data/framingham.csv’)

data.head()

Output:

Here we can see different factors that cause heart diseases. Some of them are categorical, and some of them are continuous data. Below, we can see the meanings of these variables:
- Demographic variables:
1. Male: Boolean or categorical data
2. Age: Numerical and continuous data
- Behavioural variables:
1. Current smoker: Tell about current smoking habits(boolean)
2. cigsPerDay: Average number of cigarettes consumed by a person in a day.
- Medical(history):
1. BPMeds: Whether a person consumes BP(blood pressure) medication(boolean)
2. prevalentStroke: Previous stroke records(boolean)
3. prevalentHyp: Prevalent hypertensive history(boolean)
4. Diabetes: Diabetes status(boolean)
5. totChol: Total cholesterol level (Continuous)
6. sysBP: Systolic blood pressure (Continuous)
7. diaBP: Diastolic blood pressure (Continuous)
8. BMI: Body Mass Index (Continuous)
9. heartRate: Heart rate (Continuous)
10. glucose: Glucose level (Continuous)
- Target variables
1. TenYearCHD(ten year risk of coronary heart disease)(Boolean)
In the above, we have seen that in the column male, we have values 0 and 1. Therefore, we are considering 1 as male and 0 as female and changing the data accordingly using the following lines of codes:

data[‘male’] = data[‘male’].map({1: ‘male’, 0: ‘female’})

data.rename(columns={‘male’:’Sex’},inplace = True)

Lets perform basic EDA part on data before modelling it./

Data Exploration

Let’s begin with knowing the shape of data.

print(‘shape of the data’, data.shape)

Output:

In the data, we have 4240 data points and 16 variables, out of which 15 are independent, and 1 is dependent.

Describing data

data.describe()

Output:

Here we can see the data distribution, but it will be much better to see it using visualisation because the data size is large. Before the visualisation part, we should check for the null values.

import missingno as ms

print(data.isna().sum(

ms.bar(data)

Output:

We can see that there are various null values present in multiple variables. Let’s see the total number of null values in the data.

print(‘null values percentage’,round(data.isna().sum().sum()/len(data.index)*100

Output:

We can see only 15% of values are null in the dataset, so instead of focusing on null values, we can delete them from the dataset.

data.dropna(axis=0,inplace=True)

data.shape

Output:

Here we have completed the basic EDA of our dataset. Next, let’s understand data more using some visualisations.

Gender distribution

import plotly.express as px

fig = px.histogram(data, x=”Sex”, color=”Sex”,barmode=’group’,title = ‘gender distribution’,width = 600, height= 400)

fig.show()

Age distribution

import matplotlib.pyplot as plt

import seaborn as sns

fig, axs = plt.subplots(nrows=2, figsize=(8, 10

sns.boxplot(data[‘age’],ax=axs[0]).set(title=’age distribution-boxplot’)

sns.distplot(data[‘age’],ax=axs[1]).set(title=’age distribution-histogram’)

plt.show()

Output:

In the above visualisation, we can see that most records are between the ages of 30 and 70, and the average age in the data is nearly 45 to 50.

Let’s see how the age variable is distributed according to gender.

fig = px.histogram(data, x=”age”, color=”Sex”, title=’Age distribution according to sex’,nbins = 60,width = 600, height= 400)

fig.show()

This visualisation represents how the age of different gender is distributed, and by looking at it, we can say the distribution of females between 45 to 50 is higher than men. So now let’s see how many males and females are addicted to cigarettes.

fig = px.histogram(data, x=”Sex”, color=”currentSmoker”,barmode=’group’,title = ‘Smoker according to gender’,width = 600, height= 400)

fig.show()

Output:

Here we can see that males are more addicted to cigarettes than the women. Now instead of plotting every feature from data we will try to merge them in one plot and try to understand how these features are related to each other.

fig, axis = plt.subplots(3,5,figsize=(20, 20

data.hist(ax=axis)

Output:

In the above, if we compare all plots with the plot of TenYearCHD, we can find that the graph of Bpmeds is similar to the TenyearCHD, which means if someone is taking medicines for blood pressure, then there are chances for that person to be on the risk of coronary heart disease.

Also, the diabetes plot is similar to CHD and prevalentStroke, and prevalentHyp are pretty identical, and they can cause a significant effect on TenYearCHD.

Features with continuous data can explain themselves more using a correlation plot.

corrmat = data.corr()

fig = px.imshow(corrmat,text_auto=True,width = 700, height= 1000)

fig.show()

Output:

Here we can see that various features are highly correlated to each other, and when we look at the plot, these values are positively correlated. Values like glucose level and diaBP always correlate to each other.

Now here, our motive of data exploration is almost completed, and we should move toward the assumptions we consider in classification modelling.

Assumptions

In our last article, we have discussed five assumptions that we consider while using logistic regression for data modelling.
1. Dependent variable with binary data points: In the above, we have already seen that TenYearCHD is binary data.
2. Independent data points: Since collected data points are of different persons, there is no dependability between the data points.
3. Linearity between the independent variable and their odds: This assumption can be validated using a regression plot of the independent (continuous) variable. we can do that using the seaborn library:
fig, axs = plt.subplots(ncols=2, nrows=4, figsize=(8, 10

sns.regplot(x= ‘age’, y= ‘TenYearCHD’, data= data, ax=axs[0,0], logistic= True).set_title(“age Log Odds Linear Plot”)

sns.regplot(x= ‘cigsPerDay’, y= ‘TenYearCHD’, data= data, ax=axs[0,1], logistic= True).set_title(“cigsPerDay Log Odds Linear Plot”)

sns.regplot(x= ‘totChol’, y= ‘TenYearCHD’, data= data, ax=axs[1,0], logistic= True).set_title(“TotChol Log Odds Linear Plot”)

sns.regplot(x= ‘sysBP’, y= ‘TenYearCHD’, data= data, ax=axs[1,1], logistic= True).set_title(“sysBP Log Odds Linear Plot”)

sns.regplot(x= ‘diaBP’, y= ‘TenYearCHD’, data= data, ax=axs[2,0], logistic= True).set_title(“diaBP Log Odds Linear Plot”)

sns.regplot(x= ‘BMI’, y= ‘TenYearCHD’, data= data, ax=axs[2,1], logistic= True).set_title(“BMI Log Odds Linear Plot”)

sns.regplot(x= ‘heartRate’, y= ‘TenYearCHD’, data= data, ax=axs[3,0], logistic= True).set_title(“heartRate Log Odds Linear Plot”)

sns.regplot(x= ‘glucose’, y= ‘TenYearCHD’, data= data, ax=axs[3,1], logistic= True).set_title(“glucose Log Odds Linear Plot”)

# set the spacing between subplots

plt.subplots_adjust(left=0.1,

bottom=0.0,

right=0.9,

top=0.9,

wspace=0.4,

hspace=0.4)

fig.show()

Output:
1. Here we are required to look at the shape of the plots. If plots resemble the s-shaped curve, we can assume that continuous independent variables are linearly related to their log odds. If any plot appears u-shaped, then we need to consider the data handling. In the above, every plot resembles the s-shaped or linear curve, so we don’t need to perform any data operation here.
2. No Multicolinearity: No Multicollinearity: In the above, we have already seen a hugemulticollinearity between the independent continuous variable. We will deal With this multicollinearity later in the article.
3. Enough data: Enough data: At the beginning of data exploration, we have seen more than 4000 data points available, which is quite enough to use logistic regression.
Lets move toward the next section.

Modelling

In the above sections, we have explored data and looked at the assumptions we need to consider while modelling data using linear regression. In this article, we will use a technique called forward feature selection with a linear regression model to help us select the right features for modelling data using logistic regression. But before performing forward feature elimination, we are required to perform some of the data processing. So let’s start with data processing.

Data preprocessing

Converting the categorical values into numerical values

data[‘Sex’] = data[‘Sex’].map({‘male’: 1, ‘female’: 0})

Splitting the data into test and train sets.

from sklearn.model_selection import train_test_split

X = data.iloc[:,:-1]

y = data.iloc[:, -1]

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.25, random_state = 42)

Here we have converted 75% data into train and the rest of the data into the test. Now move towards the forward feature elimination process, and Before going into implementation, let’s understand this technique.

Forward Feature Selection

As the name suggests, it is a feature selection technique work iteratively. In the first iteration, it starts by putting zero features in the model and with each iteration, it keeps adding the rest of the features from the data. Finally, in the end, testing every feature throws the best model in front of us. The below image represents the forward feature elimination.

Image source

Lets implement the forward feature elimination.

sfs = SequentialFeatureSelector(LogisticRegression(n_jobs = -1),

k_features = (1,14),

forward= True,

floating = False,

verbose= 2,

scoring= ‘accuracy’,

cv = 5).fit(X_train, y_train)

Output:

Here in the above, we can see the history of the feature elimination process. And now, we can extract more pieces of information from this module, like the optimised feature used table of tests. Etc. Let’s check them.

Index of selected features

sfs.k_feature_idx_

Output:

Names of the selected features

sfs.k_feature_names_

Output:

The Score of optimised model

sfs.k_score_

Output:

Here we have chosen this method to perform because the data we have has multicollinearity, and we are required to extract features that can predict better. We can also look at the whole procedure of feature forward elimination.

pd.DataFrame.from_dict(sfs.get_metric_dict(.T

Output:

Here we can see the accuracy of different models in the avg_score column. Let’s fit the logistic regression model with optimised features.

Extracting the features from the main data

new_features=data[[‘BPMeds’, ‘totChol’, ‘diaBP’, ‘heartRate’,’TenYearCHD’]]

X=new_features.iloc[:,:-1]

y=new_features.iloc[:,-1]

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.25, random_state = 42)

Fitting the model and generating predictions based on the test set.

lro=LogisticRegression()

lro.fit(X_train,y_train)

y_pred=lro.predict(X_test)

After making predictions, we are required to evaluate the fitted model. the Logistic regression model can be evaluated in 4 methods. Let’s take a look at them.

Evaluation

Here we are going to evaluate our model using 4 methods:
- Confusion matrix
- Evaluation statistics
- ROC curve (receiver operating characteristic curve)
- AUC(area under the curve)
Let’s start with our first methods.

Confusion matrix

this matrix will let us know how many right and wrong predictions the model generates.

sklearn.metrics.plot_confusion_matrix(logreg, X_test, y_test)

Output:

Here we can see that (764 + 1) 765 values out of 915 are predicted right, and 150(148+2) are wrong. Based on the confusion matrix, we can evaluate our other evaluation statistics.

Evaluation statistics

TN=cm[0,0]

TP=cm[1,1]

FN=cm[1,0]

FP=cm[0,1]

sensitivity=TP/float(TP+FN)

specificity=TN/float(TN+FP)

print(‘The acuuracy of the model’,(TP+TN)*100/float(TP+TN+FP+FN),’% n’,

‘The Missclassification percentage’,100-TP+TN)*100/float(TP+TN+FP+FN,’% n’,

‘Sensitivity or True Positive Rate’,TP*100/float(TP+FN),’% n’,

‘Specificity or True Negative Rate’,TN*100/float(TN+FP),’% n’,

‘Positive Predictive value’,TP*100/float(TP+FP),’% n’,

‘Negative predictive Value’,TN*100/float(TN+FN),’% n’,

‘Positive Likelihood Ratio’,sensitivity/(1-specificity),’n’,

‘Negative likelihood Ratio’,(1-sensitivity)/specificity)

Output:

Above are the evaluation statistics that tell us about the model’s accuracy, misclassification, positive and negative predicted percentage and other ratios. One thing I observed above is that model is more specific than sensitive. Also, it is more prone to calculate negative values correctly. So, let’s check the probabilities using which model is making predictions.

pred_prob=logreg.predict_proba(X_test)[:,:]

pred_prob_df=pd.DataFrame(data=pred_prob, columns=[‘Prob of predicting negative (0)’,’Prob of predicting possitive (1)’])

pred_prob_df.head()

Output:

In the above, threshold used by the model is 0.5(by default). Using this model should not be advisable because it is biased toward one class. We can iterate between other threshold values to make a better prediction.

pred_proba_df = pd.DataFrame(logreg.predict_proba(X_test

threshold_list = [0.05,0.1,0.15,0.2,0.25,0.3,0.35,0.4,0.45,0.5,0.55,0.6,0.65,.7,.75,.8,.85,.9,.95,.99]

for i in threshold_list:

print (‘n******** For i = {} ******’.format(i

y_test_pred = pred_proba_df.applymap(lambda x: 1 if x>i else 0)

test_accuracy = sklearn.metrics.accuracy_score(y_test.to_numpy().reshape(y_test.to_numpy().size,1),

y_test_pred.iloc[:,1].to_numpy().reshape(y_test_pred.iloc[:,1].to_numpy().size,1

print(‘Our testing accuracy is {}’.format(test_accuracy

cm2 = sklearn.metrics.confusion_matrix(y_test.to_numpy().reshape(y_test.to_numpy().size,1),

y_test_pred.iloc[:,1].to_numpy().reshape(y_test_pred.iloc[:,1].to_numpy().size,1

print (‘With’,i/10,’threshold the Confusion Matrix is ‘,’n’,cm2,’n’,

‘with’,cm2[0,0]+cm2[1,1],’correct predictions and’,cm2[1,0],’Type II errors( False Negatives)’,’nn’,

‘Sensitivity: ‘,cm2[1,1]/(float(cm2[1,1]+cm2[1,0],’Specificity: ‘,cm2[0,0]/(float(cm2[0,0]+cm2[0,1],’nnn’)

Output:

Part 1:

Part 2:

Here we can see that as the threshold gets lower, the model increases sensitivity, and after a threshold value, it is stable or decreases. So ROC curve can provide a more accurate picture.

ROC curve

This curve lets us know about the performance of classification models when all classification thresholds are considered. On a fundamental note, it shows a trade-off between sensitivity and specificity of the model.

y_pred2=logreg.predict_proba(X_test)

fpr, tpr, thresholds = sklearn.metrics.roc_curve(y_test, y_pred2[:,1])

plt.plot(fpr,tpr)

plt.xlim([0.0, 1.0])

plt.ylim([0.0, 1.0])

plt.title(‘ROC curve for classifier’)

plt.xlabel(‘False positive rate (1-Specificity)’)

plt.ylabel(‘True positive rate (Sensitivity)’)

plt.grid(True)

Output:

Looking at the above, we can say that the optimum position for the ROC curve is at the top left corner because a model can be a good classifier only if it can predict more true positives than False positives. Let’s check the area under the curve.

Area Under the Curve(AUC)

The area under the curve can be considered representative of the model’s accuracy because the higher area shows us a higher disparity between the true and false positive and higher capability of the model to classify training data. However, a good measure of AUC starts from 0.5, and as it goes towards 1 it represents better classification.

print(‘Area Under The Curve(AUC): ‘,sklearn.metrics.roc_auc_score(y_test,y_pred2[:,1]

Output:

Looking at the AUC, we can say our model is a moderate classifier.

Final words

In the above, we have seen how we can perform classification modelling using logistic regression modelling. During the process, we look at the data insights. After that, we used forward feature elimination to choose the right features from the data to get higher accuracy in modelling. We have also gone through Some of the assumptions we need to take care of during modelling and some of the evaluation techniques we can use to evaluate the model correctly. In the end, we can make following colclusion:
- Forward feature elimination helped find significant features that can impact heart disease prediction.
- Data analysis told us men are more addicted to cigarettes and susceptible to heart diseases.
- Total cholesterol is an important predictor of heart disease.
- The model we made is more specific than sensitive.
- AUC is around 0.62, which is quite satisfactory.
- The model requires more data from category one, so the class imbalance problem can be sorted.
References

link to the codes
August 20, 2022
Getting Started with Machine Learning Algorithms: Logistic Regression
In the field of data science, we mainly find a variety of algorithms or models to perform regression and classification modelling. Logistic regression can be considered the first point of your learning line of data science, classification and predictive modelling. Since it comes under the regression model family it uses a curve to classify data in classes. We at DSW highly prefer to model small use-cases and problems utilising such small algorithms because these are highly robust and easy to interpret. In this article, we are going to talk about logistic regression. Let’s just start by what the logistic regression algorithm is.

What is logistic regression?

Logistic regression is one of the most basic and traditional algorithms or models that comes under the supervised machine learning classes used for classification and predictive modelling. In basics, these algorithms can be used to model the probability of an event or a class. Since it comes from the regression class models it uses the lines or curves to model the data and we use it where the dataset we have can be separated using lines, and the outcome from the model is required to be binary or dichotomous.

That means we use logistic regression for binary classification and binary classification works when the target variable is separated into two classes. Simple examples of binary classification are yes/no, 0/1, win/loss etc.

There are two types of logistic regression
- Simple logistic regression
- Multiple logistic regression
Where simple logistic regression is utilised where only one independent variable is affecting the dependent variable and multiple logistic regression is utilised when there are more than two independent variables affecting the dependent or target variable.

However, this can also be extended to the multinomial logistic regression and ordinal logistic regression where the number of classes is discrete in more than two classes or more than two classes of an ordinal nature. Since we majorly believe that using one line we can not separate more than two classes accurately we are going to learn about the simplest versions of logistic regression that can be utilised for binary classification. Let’s see how this algorithm works.

How does Logistic Regression Work?

As discussed above logistic regression works by separating linear separable data just like linear regression. To understand the working of logistic regression we are required to understand the mathematics behind it.

Mathematics

Let’s consider there is one predictor or independent variable ‘X’ and one dependent variable y and the probability of y being 1 is P. In such a situation the equation of linear regression can be written as:

p = mx + m0 …..(1).

The right side of the above equation is a linear equation and can hold beyond the range 0 to 1. And we all know that probability can vary between 0 to 1 only. So to overcome that we can predict odds in place of probability using the following formula:

Odds = p/(1-p)

Where,

p = probability of occurrence of any event.

1-p = probability of non-occurrence of any event.

According to odds, 1 can be written as:

p/(1-p) = mx + m0 .….(2)

Here we also need only a positive number that can be handled using the log form of the left side of equation 2.

log(p/(1-p)) = mx + m0 …..(3)

To recover the above equation we need to use the exponential form of both side

e(log(p/(1-p)) = e(mx + mo) …..(4)

While simplifying equation four we will get the following equation

p = (1-p)(e(mx + mo)) ..…(5)

We can also write this equation as follows

p = p((e(mx + mo))/p — e(mx + mo))

p =e(mx + mo)/(1 + e(mx + mo)) .….(6)

Now we can also multiply e(mx + mo) / e(mx + mo) on equation 6.

p = 1/(1 + e-(mx + mo))

Above is the final probability that logistic regression uses if the above-given condition is true. But if there are n predictors then the calculatory equation will be as follows

p = 1/(1+ e-(m0+m1x1+m2x2+m3x3+ — — +mnxn))

The above is the final equation of logistic regression when there is n predictors. Some experienced persons compare this equation with the sigmoid function because it also controls the range of output between 0 to 1.

In the above, we can see how we started with the linear equation and ended with the curve.

A Mathematically sigmoid function can be written as follows:

(z) = 1/(1+e-z)

In the above, we are required to replace z with e-(mx + mo) to make it an equation of logistic regression. Looking at the equation we can say the below image will be a representation of the working of logistic regression.

In the above image, we can see how logistic regression keeps the curve between the values 0 and 1. Now before utilising logistic regression on any data we are required to consider some of the assumptions. Let’s take a look at the assumptions.

Assumptions

Before modelling data using a basic logistic regression algorithm we are required to consider the following assumptions:
- If any extensions are not applied then data needs to have a dependent variable with binary data points.
- The data points under the data need to be independent of each other.
- The independent variables of the data need to have no or small multicollinearity with each other.
- The independent variables and their odds need to be linear to each other.
- One thing that sometimes becomes mandatory according to the suggestions is that it is good to work with a large data size while utilising logistic regression in the process.
Here we have seen some of the assumptions that need to be covered before applying logistic regression. Let’s see how we can apply logistic regression to any data.

Implementation

In this section, we will look at how we can apply logistic regression to data using the Python programming language. However, we can also use R, MATLAB and excel for performing logistic regression but considering the size of the article we are using only Python.

In Python, Sklearn is a library which provides functions for applying every kind of machine learning algorithm in our datasets and for applying logistic regression we have the LogisticRegression method under the linear_model package which we utilise here.

Let’s start with making a synthetic dataset using the make_classification function of Sklearn.

from sklearn.datasets import make_classification

X, y = make_classification(n_samples=1000, n_features=5, random_state=42)

Here we have created a dataset which has 1000 rows and 5 columns in the independent variable dataset and 1 dependent variable with two classes 0 and 1. We can validate it by converting these arrays in the pandas DataFrame.

import pandas as pd

df = pd.DataFrame(data=X)

df[‘Target’] = y

df

Output:

Here we can see all the independent and dependent variables in one place.

Before going for modelling we are required to know which variables from our dataset have a better correlation with the target variable. Let’s check the correlation.

import seaborn as sns

corr=df.corr()

sns.heatmap(corr, annot=True,

fmt=’.1%’)

Output:

Here we can see that variables 0, 1, and 3 have higher correlations with the target variable and we can consider them in the data modelling with logistic regression.

Let’s trim and split the datasets

X = X[:, [0,1,3]]

from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.29, random_state = 0)

Let’s import the function from Sklearn and model the data.

from sklearn.linear_model import LogisticRegression

model = LogisticRegression()

model.fit(X_train, y_train)

Let’s make some predictions so that we can validate the model

y_pred = model.predict(X_test)

y_pred

Output:

Here we can see the prediction made by the model. Now we need to evaluate this model.

Evaluation

Evaluation of a classification model can be done in various ways. Since it’s a binary classification model we find that there are two prime methods which can help us in the evaluation. These methods are as follows:

Accuracy score

It is the universal method for evaluation of any classification model which mainly compares actual and predicted values using the following formula

accuracy(y,ŷ) =( 1/nsample)i=0nsample-11(ŷ=yi)

Where,

y = actual values

ŷ= predicted values

This simply calculates error predictions and gives the calculated error in the form of percentage values. Let’s see how we can calculate it for the above model.

from sklearn.metrics import accuracy_score

print (“Accuracy of binary classification : “, accuracy_score(y_test, y_pred)*100,”%”)

Output:

Here we can see the accuracy of our model is good enough. Let’s verify it using another evaluation method.

Confusion matrix

This method tells us how many right decisions were taken by the model. As the name suggests this is a matrix which holds the following values under the cells

Here we can see that the true positives and true negatives are the values that the model has predicted right and other values are wrongly predicted. Let’s check how many right values our model is predicting right from the test data.

import seaborn as sns

cm = confusion_matrix(y_test, y_pred)

sns.heatmap(cm/np.sum(cm), annot=True,

fmt=’.1%’)

Output:

In the above, we can see that 11.7% (7.6 + 4.1) values are not accurately predicted by the model and 88.3% values are correctly predicted. Let’s take a look at how the logistic regression model is being utilised in the real world.

Application of logistic regression

There are a variety of use-cases that can be found solved using logistic regression in a variety of fields like medical, politics, engineering and marketing. For a simple example, these models can be utilised to predict the risk of disease development in a human body by observing its characteristics or for predicting the mortality of injured humans and animals.

In politics, we can use it for predicting the number of voters in an election who are going to vote for a party by observing the demographics of voters. In engineering, we find that this model is being utilised for the failure optimization or prediction of various components, processes and systems. In marketing, it is being utilised for predicting the propensity of customers regarding the purchase of any product or service using the analysis of demographics of customers.

This model can be extended to perform operations in different domains of AI where sequential data is being collected and analysed like NLP and computer vision.

Final words

In this article, we have discussed one of the basic algorithms in machine learning which is logistic regression. There are a variety of use cases where we can find this algorithm most reliable. For instance, it has been utilised for predicting the mortality of injured humans in Trauma and Injury Severity Score (TRISS). Such examples make us believe that for some simple cases we can rely on logistic regression.

References
- Link to the codes
- Wikipedia- logistic regression
About DSW

Data Science Wizards (DSW) is an Artificial Intelligence and Data Science start-up that primarily offers platforms, solutions, and services for making use of data as a strategy through AI and data analytics solutions and consulting services to help enterprises in data-driven decisions.

DSW’s flagship platform UnifyAI is an end-to-end AI-enabled platform for enterprise customers to build, deploy, manage, and publish their AI models. UnifyAI helps you to build your business use case by leveraging AI capabilities and improving analytics outcomes.

Connect us at contact@datasciencewizards.ai and visit us at darkseagreen-chicken-141904.hostingersite.com/
July 11, 2022
Getting Started with Machine Learning Algorithms: Naive Bayes
In supervised machine learning, the Naive Bayes algorithm is one of the most common algorithms we can use for both binary and multiple-class classification tasks. Since it has a wide range of real-life applications, it becomes crucial to learn about the concept behind these algorithms. So in this article, we will get an introductory guide to the k-nearest neighbour using the following major points.

Table of content
- What is Naive Bayes?
- How does a Naive Bayes algorithm work?
- Assumptions of Naive Bayes
- Code example
- Pros and Cons of Naive Bayes
What is Naive Bayes?

In machine learning and data science space, naive Bayes is one of the popular algorithms which we use for classification tasks. Talking about the idea behind this algorithm, we can say it is based on Baye’s theorem of probability theory, named after Reverend Thomas Bayes. According to this theorem, the probability of a hypothesis (in this case, a particular class) is proportional to the probability of the evidence (the input features) given that hypothesis.

In naive Bayes, the word Naive refers to the assumption that the input features are conditionally independent given the class. The assumption is called naive because it is generally an oversimplification of real-world scenarios where the feature can depend on each other. Let’s take an example of a text classification scenario where we often find words or text from a document as an input feature. According to this assumption, the occurrence of one word does not affect the occurrence of other words in the same document, given the class. This is often not true because, generally, the occurrence of certain words in a document can affect the likelihood of other words appearing as well. Despite this naive assumption, Naive Bayes can still perform well in many real-world applications.

Instead of dwelling more on what naive Bayes is, we can understand naive Bayes by its working. So let’s know how naive Bayes works.

How does naive Bayes work?

As discussed above, it is based on the Bayes theorem of probability subject, naive Bayes working is dependent on calculating the probability of each possible class given the input feature. Absolutely it happens when the algorithm applies Bayes’s theory. In simplification, we can understand the Bayes theory by using the following mathematical notation:

P(class | features) ∝ P(features | class) x P(class) / P(features)

Where
- P(class | features) is the posterior probability of the class given the input features.
- P(features | class) is the likelihood of the input features given the class.
- P(class) is the prior probability of the class.
- P(features) is the marginal probability of the evidence (i.e., the input features).
The above notation can be explained as the probability of a hypothesis of a class label given the evidence of the input features, which is directly proportional to the probability of the evidence given the hypothesis multiplied by the prior probability of the hypothesis, divided by the marginal probability of the evidence. However, here the likelihood term is calculated assuming that the input features are conditionally independent given the class, as follows:

P(features | class) = P(feature_1 | class) x P(feature_2 | class) x … x P(feature_n | class)

where feature_1, feature_2, …, feature_n are the input features, and P(feature_i | class) is the probability of feature_i given the class.

By just using the likelihood and prior probabilities, we can simplify the formula for Naive Bayes to:

P(class | features) = normalization factor x P(feature_1 | class) x P(feature_2 | class) x … x

P(feature_n | class) x P(class)

Here the normalization factor is a constant that makes the probabilities sum up to 1, and the P(feature_i | class) and P(class) can be estimated using the training data.

Basically saying, To classify a new instance, Naive Bayes calculates the probability of each possible class label given the input features. Then using the above formula, it selects the class label with the highest probability as the predicted label for the instance.

When we go deeper into the subject, we find there are three major variants of the Naive Bayes algorithm which can be used for different use cases. basic details about these variants are as follows:
- Gaussian Naive Bayes: This variant is used when the input features are continuous or numerical. It assumes that the input data follows a Gaussian distribution and estimates the mean and variance of each feature for each class. This variant is widely used in classification problems that involve continuous features, such as predicting the price of a house based on its features.
- Multinomial Naive Bayes: This variant is used when the input features are discrete or categorical. It assumes that the input data follows a multinomial distribution and estimates the probabilities of each feature for each class. This variant is widely used in text classification problems, such as classifying emails as spam or not spam based on their content.
- Bernoulli Naive Bayes: This variant is similar to Multinomial Naive Bayes but is used when the input features are binary or Boolean. It assumes that the input data follows a Bernoulli distribution and estimates the probabilities of each feature being present or absent for each class. This variant is also widely used in text classification problems, such as classifying documents as positive or negative based on the presence or absence of certain words.
Now let’s take a look at the assumptions we might need to take care of when choosing Naive Bayes for any data modelling procedure.

Assumption of Naive Bayes

Here are the important assumptions that we should consider When applying Naive Bayes for data modelling:
- First of all, the Naive Bayes assumes that the input features are conditionally independent given the class label. So Independence of Features is one of the most important assumptions to cater for naive Bayes. In a more general sense, we can say the presence or absence of one feature does not affect the probability of another feature occurring.
- As the naive Bayes algorithm treats all input features as equally important in predicting the class label so the Equal Importance of Features becomes the second assumption.
- When training a naive Bayes model on data, we need to consider Enough Training Data so that it can give a reliable estimation of the probabilities of the input features given the class label.
- The data we use to model this algorithm should come with the Absence of Redundancy, meaning that the features should not provide redundant or overlapping information about the class label.
- The training data we are using with a naive Bayes model should have a Balanced Class Distribution. Unbalanced class distribution can lead to inaccuracy of the model, or the model can become biased toward the overrepresented class.
However, in many cases, it has been seen that this model can still perform well enough if the dependence among the features is not too strong. After knowing about the basics of Naive Bayes, let’s take a look at the code implementation.

Code Example

Here, in this implementation of Naive Bayes, we are going to use Python programming language in which we get modules to generate synthetic data, split data and model functions under the libraries such as sklearn and NumPy. Let’s start the implementation by importing the libraries and the modules.

Importing libraries

import numpy as np

from sklearn.model_selection import train_test_split

from sklearn.naive_bayes import GaussianNB

Here, we have called the NumPy library, which we will use to make synthetic data and make calculations, and the modules for splitting and model data using the Gaussian naive Bayes model.

Generating Data

Let’s make a dataset

# Generate random data

X = np.random.rand(1000, 5)

y = np.random.randint(0, 2, size=1000)

Here, we have generated random data with 1000 samples and five features, where the target variable (y) is a binary class label.

Let’s split the data

# Split data into training and testing sets

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3)

Here, we have split the synthetic data into 70% training and 30% testing data.

Model training

Now, we can train the Gaussian naive Bayes model using the training data. Let’s make a model object.

# Create Naive Bayes model

model = GaussianNB()

Let’s fit the training data into the Naive Bayes model(above defined model object).

# Train model on training data

model.fit(X_train, y_train)

Output:

Here, we have trained the model object in the training data. Now we can make predictions and evaluate our model.

Model Evaluation

Let’s make a prediction out of our trained model.

# Make predictions on testing data

y_pred = model.predict(X_test)

let’s evaluate the model based on the predictions made by the model itself.

# Evaluate model performance

accuracy = np.mean(y_pred == y_test)

print(“Accuracy: “, accuracy)

Output:

# Evaluate model performance

accuracy = np.mean(y_pred == y_test)*100

print(“Accuracy: “, accuracy,’%’)

Output:

Here, we can see that our model has performed with almost 50% of accuracy. However, it is not an optimal performance, but our aim to learn about the implementation is completed.

In this list of articles, we discuss a machine learning algorithm two times where we learn the basics of algorithms the first time and in advance (second article), we see how we can use the algorithm in a more advanced manner so that we can get optimal performance from the machine learning algorithm.

So, please subscribe to us to learn the more advanced way to model data using different machine learning algorithms. Let’s take our discussion ahead and know the pros and cons of the Naive Bayes algorithm.

Pros and cons of Naive Bayes

There are several advantages and disadvantages of any machine learning algorithm. Similarly, naive Bayes has its own pros and cons. Some of them are listed below:

Pros
- Naive Bayes can handle both continuous and categorical data, making it versatile for different types of datasets.
- The algorithm is less prone to overfitting, which means Naive Bayes can generalize well to new data.
- Naive Bayes performs well in high-dimensional datasets where the number of data features is larger than the number of data observations.
- We can use Naive Bayes for both binary and multi-class classification problems.
- Naive Bayes is relatively easy to implement and can be used as a baseline model for other, more complex algorithms.
Cons
- The assumption of all features being independent of each other becomes a con because this is rarely true in real-world datasets.
- Naive Bayes can be affected by the presence of outliers in the data.
- Naive Bayes relies heavily on the quality of the input data and can perform poorly in the case of data being noisy or containing missing values.
- The algorithm can have difficulties handling datasets with rare events, which can lead to underestimation of probabilities.
- Naive Bayes is a probabilistic algorithm, which means that it can sometimes produce unreliable probabilities for rare events or extreme cases.
Final words

In the above article, we have discussed the naive Bayes algorithm, which is one of the popular algorithms in the machine learning space. By looking at the basic of it, we can say that it is mostly based on probability theories. This algorithm can be a good choice to work with when fewer calculations are required or the features in the dataset are independent of each other or have very low correlation. We have also discussed the assumption we need to take into account as well as the pros and cons of this algorithm.

To know more about different machine learning algorithms, one can subscribe to us. More details about us can be found below.

About DSW

DSW, specializing in Artificial Intelligence and Data Science, provides platforms and solutions for leveraging data through AI and advanced analytics. With offices located in Mumbai, India, and Dublin, Ireland, the company serves a broad range of customers across the globe.

Our mission is to democratize AI and Data Science, empowering customers with informed decision-making. Through fostering the AI ecosystem with data-driven, open-source technology solutions, we aim to benefit businesses, customers, and stakeholders and make AI available for everyone.

Our flagship platform ‘UnifyAI’ aims to streamline the data engineering process, provide a unified pipeline, and integrate AI capabilities to support businesses in transitioning from experimentation to full-scale production, ultimately enhancing operational efficiency and driving growth.
May 18, 2022