Decision boundary of label propagation versus SVM on the Iris dataset

https://scikit-learn.org/stable/auto_examples/semi_supervised/plot_label_propagation_versus_svm_iris.html#sphx-glr-auto-examples-semi-supervised-plot-label-propagation-versus-svm-iris-py

此範例是比較藉由Label Propagation(標籤傳播法)及SVM(支持向量機)對iris dataset(鳶尾花卉數據集)生成的decision boundary(決策邊界)

Label Propagation屬於一種Semi-supervised learning(半監督學習)

(一)引入函式庫

numpy : 產生陣列數值
matplotlib.pyplot : 用來繪製影像
sklearn import datasets : 匯入資料集
sklearn import svm : 匯入支持向量機
sklearn.semi_supervised import LabelSpreading : 匯入標籤傳播算法
```python
import numpy as np
import matplotlib.pyplot as plt

from sklearn import datasets from sklearn import svm from sklearn.semi_supervised import LabelSpreading

## (二)讀取資料集

* numpy.random.RandomState(seed=None) : 產生隨機數
* datasets.load_iris() : 將資料及存入，iris為一個dict型別資料
* X代表從iris資料內讀取前兩項數據，分別表示萼片的長度及寬度
* y代表iris所屬的class
* np.copy() : 複製數據進行操作，避免修改原本的檔案
* [rng.rand(len(y)) < 0.3] = -1 代表隨機生成150個0~1區間的值，並將小於0.3的值轉為-1 (len(y)為150)
```python
rng = np.random.RandomState(0)
iris = datasets.load_iris()
X = iris.data[:, :2]
y = iris.target
h = .02 # 設定用於mesh的step
y_30 = np.copy(y)
y_30[rng.rand(len(y)) < 0.3] = -1
y_50 = np.copy(y)
y_50[rng.rand(len(y)) < 0.5] = -1

LabelSpreading().fit() : 進行標籤傳播法並擬合數據集
svm.SVC().fit() : 進行SVC(support vectors classification)並擬合數據集

分別用不同比例已被標籤的數據和未被標籤的數據進行標籤傳播，並與SVM的結果進行對比

# we create an instance of SVM and fit out data. We do not scale our
# data since we want to plot the support vectors
ls30 = (LabelSpreading().fit(X, y_30), y_30)
ls50 = (LabelSpreading().fit(X, y_50), y_50)
ls100 = (LabelSpreading().fit(X, y), y)
rbf_svc = (svm.SVC(kernel='rbf', gamma=.5).fit(X, y), y)

(三)繪製比較圖

min()、max() : 決定x與y的範圍
np.meshgrid() : 從給定的座標向量回傳座標矩陣

這裡分別是以x,y的最大、最小值加減1並以h=0.02的間隔來繪製

x_min, x_max = X[:, 0].min() - 1, X[:, 0].max() + 1
y_min, y_max = X[:, 1].min() - 1, X[:, 1].max() + 1
xx, yy = np.meshgrid(np.arange(x_min, x_max, h),
                   np.arange(y_min, y_max, h))

定義各圖片標題

titles = ['Label Spreading 30% data',
        'Label Spreading 50% data',
        'Label Spreading 100% data',
        'SVC with rbf kernel']

為了繪製圖片，設定一個為dict型態的color_map，將4種label分別給予不同顏色

最後用下面的程式將所有點繪製出來

color_map = {-1: (1, 1, 1), 0: (0, 0, .9), 1: (1, 0, 0), 2: (.8, .6, 0)}

for i, (clf, y_train) in enumerate((ls30, ls50, ls100, rbf_svc)):
    # Plot the decision boundary. For that, we will assign a color to each
    # point in the mesh [x_min, x_max]x[y_min, y_max].
    plt.subplot(2, 2, i + 1)
    Z = clf.predict(np.c_[xx.ravel(), yy.ravel()])

    # Put the result into a color plot
    Z = Z.reshape(xx.shape)
    plt.contourf(xx, yy, Z, cmap=plt.cm.Paired)
    plt.axis('off')

    # Plot also the training points
    colors = [color_map[y] for y in y_train]
    plt.scatter(X[:, 0], X[:, 1], c=colors, edgecolors='black')

    plt.title(titles[i])

plt.suptitle("Unlabeled points are colored white", y=0.1)
plt.show()

顯示了即使只有少部分被標籤的數據，Label Propagation也能很好的學習產生decision boundary
(四)完整程式碼

https://scikit-learn.org/stable/_downloads/97a366ef6b2f7394d4eb409814bf4842/plot_label_propagation_versus_svm_iris.py

print(__doc__)

# Authors: Clay Woolam <clay@woolam.org>
# License: BSD

import numpy as np
import matplotlib.pyplot as plt
from sklearn import datasets
from sklearn import svm
from sklearn.semi_supervised import LabelSpreading

rng = np.random.RandomState(0)

iris = datasets.load_iris()

X = iris.data[:, :2]
y = iris.target

# step size in the mesh
h = .02

y_30 = np.copy(y)
y_30[rng.rand(len(y)) < 0.3] = -1
y_50 = np.copy(y)
y_50[rng.rand(len(y)) < 0.5] = -1
# we create an instance of SVM and fit out data. We do not scale our
# data since we want to plot the support vectors
ls30 = (LabelSpreading().fit(X, y_30), y_30)
ls50 = (LabelSpreading().fit(X, y_50), y_50)
ls100 = (LabelSpreading().fit(X, y), y)
rbf_svc = (svm.SVC(kernel='rbf', gamma=.5).fit(X, y), y)

# create a mesh to plot in
x_min, x_max = X[:, 0].min() - 1, X[:, 0].max() + 1
y_min, y_max = X[:, 1].min() - 1, X[:, 1].max() + 1
xx, yy = np.meshgrid(np.arange(x_min, x_max, h),
                     np.arange(y_min, y_max, h))

# title for the plots
titles = ['Label Spreading 30% data',
          'Label Spreading 50% data',
          'Label Spreading 100% data',
          'SVC with rbf kernel']

color_map = {-1: (1, 1, 1), 0: (0, 0, .9), 1: (1, 0, 0), 2: (.8, .6, 0)}

for i, (clf, y_train) in enumerate((ls30, ls50, ls100, rbf_svc)):
    # Plot the decision boundary. For that, we will assign a color to each
    # point in the mesh [x_min, x_max]x[y_min, y_max].
    plt.subplot(2, 2, i + 1)
    Z = clf.predict(np.c_[xx.ravel(), yy.ravel()])

    # Put the result into a color plot
    Z = Z.reshape(xx.shape)
    plt.contourf(xx, yy, Z, cmap=plt.cm.Paired)
    plt.axis('off')

    # Plot also the training points
    colors = [color_map[y] for y in y_train]
    plt.scatter(X[:, 0], X[:, 1], c=colors, edgecolors='black')

    plt.title(titles[i])

plt.suptitle("Unlabeled points are colored white", y=0.1)
plt.show()

PreviousEx 4: Label Propagation digits active learning NextEnsemble_methods

Last updated 4 years ago