A Peek into SVM Classification through Iris


SVM Poster

A Support Vector Machine (SVM) is really a powerful Machine Learning Model. It is capable of performing both classification and regression. And Iris flower data set is just perfect for analyzing SVM. So, in this article, we are going to take a peek into the SVM model by using the Iris data set.

So, power up your jupyter notebook and follow along.

Getting the data

First, let’s load the data from the Scikit-Learn library.

from sklearn import datasets

iris = datasets.load_iris

Analyzing the Data

In this section, we will be analyzing the Iris data set and also visualize the data for better understanding. Also, it is a good thing to keep in mind that the data set contains 150 instances.

Looking at the keys that are available with the data:

iris.keys()
dict_keys(['data', 'target', 'target_names', 'DESCR', 'feature_names', 'filename'])

Looks like we can gather a lot of information about the data set. You can start looking at the relevant keys one by one.

iris.target_names
array(['setosa', 'versicolor', 'virginica'], dtype='<U10')

So, we have three flowers as the target names in the data set. The Iris setosa, Iris versicolor, and Iris Virginica. As you must have observed, the target_name is an array.

Further, you can have a look at the `target` feature and how our machine learning model is going to interpret it.

iris.target
array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,        0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, ...])
iris.target.dtype
dtype('int32')

So basically, we have the three flower names divided into three integer classes (0, 1, 2). There is not much else to take from here.

Now, taking a look into feature_names and data keys:

iris.feature_names
['sepal length (cm)',  'sepal width (cm)',  'petal length (cm)',  'petal width (cm)']

We have four features available. ‘sepal length’, ‘sepal width’, ‘petal length’, and ‘petal width’ are given in cm.

iris.data
array([[5.1, 3.5, 1.4, 0.2],        [4.9, 3. , 1.4, 0.2],        [4.7, 3.2, 1.3, 0.2],        [4.6, 3.1, 1.5, 0.2],        [5. , 3.6, 1.4, 0.2],        [5.4, 3.9, 1.7, 0.4],        [4.6, 3.4, 1.4, 0.3],        [5. , 3.4, 1.5, 0.2], ...

The data key contains a 2D array where each inner array four floating point values for the four features available for the flowers.

There is nothing fancy or very much intriguing about the data set here. The next section, where we will be classifying and visualizing the data is going to much more fun.

Visualizing the Data

Let’s plot the scatter matrices for petal length (cm) vs petal width (cm)and sepal length (cm)vs sepal width (cm). Maybe we can visualize some features that will help us proceed further with the classification.

# this will label the target names for the flowers
names = plt.FuncFormatter(lambda i, *args: iris.target_names[int(i)])

plt.figure(figsize=(12, 6))
plt.subplot(121)
plt.scatter(iris.data[:, 0], iris.data[:, 1], c=iris.target)
plt.xlabel(iris.feature_names[0])
plt.ylabel(iris.feature_names[1])

plt.subplot(122)
plt.scatter(iris.data[:, 2], iris.data[:, 3], c=iris.target)
plt.xlabel(iris.feature_names[2])
plt.ylabel(iris.feature_names[3])

plt.colorbar(ticks=[0, 1, 2], format=names)
save_fig('feature_plot')
Iris Feature Plots
Plots for the features of the Iris data set.

If you observe closely then you can see that the we are unable to distinguish between the three flowers when taking the plot of sepal length (cm) vs. sepal width (cm). The obvious distinction issue is between Iris Versicolor and Iris Virginica. But then again, the plot of petal length (cm) vs petal width (cm) gives an almost clear distinction between the three flowers. Looks like that should be our choosing criterion.

Classifying using Linear SVC

In this section of the article, we will be using linear SVC classification to classify the Iris data set. We will also check the accuracy to see how well our model performed.

from sklearn.svm import SVC
from sklearn.model_selection import train_test_split

X = iris['data'][:, (2, 3)]  # Taking only the petal length and petal width
y = iris['target']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.20)

# SVM Classifier model
svm_clf = SVC(kernel='linear', C=1.0)
svm_clf.fit(X_train, y_train) # `C` is the penalty parameter of the error term

Now that we have fit our model, let’s predict on the test set and also check the accuracy:

# Testing and checking the accuracy
from sklearn.metrics import accuracy_score

y_pred = svm_clf.predict(X_test)

# Accuracy
accuracy = accuracy_score(y_test, y_pred)
print(accuracy)
0.9666666666666667

Wow, we have above 96% accuracy. Not bad for a start.

Conclusion

In this article, you got to know how to fit a linear SVC model over the Iris data set. From here on, you can take many challenging tasks. Try finding some more complex data sets and fit a model using the Multi-class classification. Also, you can try to plot the decision boundary of the above model to get a clearer picture.

If you liked this article, then please share, like and subscribe. Also, follow me on Twitter and Facebook to get regular updates.

Liked it? Take a second to support Sovit Ranjan Rath on Patreon!
Become a patron at Patreon!

Leave a Reply

Your email address will not be published. Required fields are marked *