> For the complete documentation index, see [llms.txt](https://machine-learning-python.gitbook.io/project/llms.txt). Markdown versions of documentation pages are available by appending `.md` to page URLs; this page is available as [Markdown](https://machine-learning-python.gitbook.io/project/ex3_label_propagation_digits-demonstrating_performance/ex4_label_propagation_digits_active_learning.md).

# Ex 4: Label Propagation digits active learning

## 半監督式分類法/範例4 : Label Propagation digits active learning

本範例目的：

* 展示active learning(主動學習)進行以label propagation(標籤傳播法)學習辨識手寫數字

## 一、Active Learning 主動學習

在實際應用上，通常我們獲得到的數據，有一大部分是未標籤的，如果要套用在常用的分類法上，最直接的想法是標籤所有的數據，但一一標籤所有數據是非常耗時耗工的，因此，在面對未標籤的數據遠多於有標籤的數據之情況下，可以透過active learning，主動的挑選一些數據進行標籤。 Active learning分成兩部分：

* 從已標籤的數據中隨機抽取一小部分作為訓練集，訓練出一個分類模型
* 透過迭代，將分類器預測出來的結果再進行訓練。

## 二、引入函式與模型

* stats用來進行統計與分析
* LabelSpreading為半監督式學習的模型
* confusion\_matrix為混淆矩陣
* classification\_report用於觀察預測和實際數值的差異，包含precision、recall、f1-score及support

```python
import numpy as np
import matplotlib.pyplot as plt

from scipy import stats
from sklearn import datasets
from sklearn.semi_supervised import LabelSpreading
from sklearn.metrics import classification_report, confusion_matrix
```

## 三、建立dataset

* Dataset取自sklearn.datasets.load\_digits，內容為0\~9的手寫數字，共有1797筆
* 使用其中的330筆進行訓練(y\_train)，其中40筆為labeled，其餘290筆為unlabeled(標為-1)
* 迭代的次數設定為5次
* scikit learn網站中的範例程式敘述為10筆labeled，但原始程式碼為40筆，因此在這邊以原始碼為主

```python
digits = datasets.load_digits()
rng = np.random.RandomState(0)
indices = np.arange(len(digits.data))
rng.shuffle(indices)

X = digits.data[indices[:330]]
y = digits.target[indices[:330]]
images = digits.images[indices[:330]]

n_total_samples = len(y)
n_labeled_points = 40
max_iterations = 5

unlabeled_indices = np.arange(n_total_samples)[n_labeled_points:]
f = plt.figure()
```

## 四、利用Active learning進行模型訓練與預測

* 以下程式為每一次迭代所做的過程(for迴圈的內容)
* 每一次迭代都利用訓練過後的模型進行預測，得到predicted\_labels，並與true\_labels計算混淆矩陣與classification report

```python
if len(unlabeled_indices) == 0:
    print("No unlabeled items left to label.")
    break
y_train = np.copy(y)
y_train[unlabeled_indices] = -1

lp_model = LabelSpreading(gamma=0.25, max_iter=20)
lp_model.fit(X, y_train)

predicted_labels = lp_model.transduction_[unlabeled_indices]
true_labels = y[unlabeled_indices]

cm = confusion_matrix(true_labels, predicted_labels,
                      labels=lp_model.classes_)

print("Iteration %i %s" % (i, 70 * "_"))
print("Label Spreading model: %d labeled & %d unlabeled (%d total)"
      % (n_labeled_points, n_total_samples - n_labeled_points,
         n_total_samples))

print(classification_report(true_labels, predicted_labels))

print("Confusion matrix")
print(cm)
```

* 利用stats進行數據的統計，找出前5筆預測最不佳的結果，將其預測的label與true label和圖像顯示出來
* 每一次迭代的最後挑出上述的5筆預測最不佳的結果，進行下一次的迭代時，把相對應的true label替換給y\_train測試集裡面，其餘(第40筆之後的數據)的label依然給予-1表示unlabeled

```python
# compute the entropies of transduced label distributions
pred_entropies = stats.distributions.entropy(
    lp_model.label_distributions_.T)

# select up to 5 digit examples that the classifier is most uncertain about
uncertainty_index = np.argsort(pred_entropies)[::-1]
uncertainty_index = uncertainty_index[
    np.in1d(uncertainty_index, unlabeled_indices)][:5]

# keep track of indices that we get labels for
delete_indices = np.array([], dtype=int)

# for more than 5 iterations, visualize the gain only on the first 5
if i < 5:
    f.text(.05, (1 - (i + 1) * .183),
           "model %d\n\nfit with\n%d labels" %
           ((i + 1), i * 5 + 40), size=10)
for index, image_index in enumerate(uncertainty_index):
    image = images[image_index]

    # for more than 5 iterations, visualize the gain only on the first 5
    if i < 5:
        sub = f.add_subplot(5, 5, index + 1 + (5 * i))
        sub.imshow(image, cmap=plt.cm.gray_r, interpolation='none')
        sub.set_title("predict: %i\ntrue: %i" % (
            lp_model.transduction_[image_index], y[image_index]), size=10)
        sub.axis('off')

    # labeling 5 points, remote from labeled set
    delete_index, = np.where(unlabeled_indices == image_index)
    delete_indices = np.concatenate((delete_indices, delete_index))

unlabeled_indices = np.delete(unlabeled_indices, delete_indices)
n_labeled_points += len(uncertainty_index)
```

* 下列程式屬於for迴圈外圍

```python
f.suptitle("Active learning with Label Propagation.\nRows show 5 most "
           "uncertain labels to learn with the next model.", y=1.15)
plt.subplots_adjust(left=0.2, bottom=0.03, right=0.9, top=0.9, wspace=0.2,
                    hspace=0.85)
plt.show()
```

* 以下即為每一次迭代的結果，可以看到每一次迭代之後，micro avg逐漸上升

Out:

```
  Iteration 0 ______________________________________________________________________
  Label Spreading model: 40 labeled & 290 unlabeled (330 total)
                precision    recall  f1-score   support

             0       1.00      1.00      1.00        22
             1       0.78      0.69      0.73        26
             2       0.93      0.93      0.93        29
             3       1.00      0.89      0.94        27
             4       0.92      0.96      0.94        23
             5       0.96      0.70      0.81        33
             6       0.97      0.97      0.97        35
             7       0.94      0.91      0.92        33
             8       0.62      0.89      0.74        28
             9       0.73      0.79      0.76        34

     micro avg       0.87      0.87      0.87       290
     macro avg       0.89      0.87      0.87       290
  weighted avg       0.88      0.87      0.87       290

  Confusion matrix
  [[22  0  0  0  0  0  0  0  0  0]
   [ 0 18  2  0  0  0  1  0  5  0]
   [ 0  0 27  0  0  0  0  0  2  0]
   [ 0  0  0 24  0  0  0  0  3  0]
   [ 0  1  0  0 22  0  0  0  0  0]
   [ 0  0  0  0  0 23  0  0  0 10]
   [ 0  1  0  0  0  0 34  0  0  0]
   [ 0  0  0  0  0  0  0 30  3  0]
   [ 0  3  0  0  0  0  0  0 25  0]
   [ 0  0  0  0  2  1  0  2  2 27]]
  Iteration 1 ______________________________________________________________________
  Label Spreading model: 45 labeled & 285 unlabeled (330 total)
                precision    recall  f1-score   support

             0       1.00      1.00      1.00        22
             1       0.79      1.00      0.88        22
             2       1.00      0.93      0.96        29
             3       1.00      1.00      1.00        26
             4       0.92      0.96      0.94        23
             5       0.96      0.70      0.81        33
             6       1.00      0.97      0.99        35
             7       0.94      0.91      0.92        33
             8       0.77      0.86      0.81        28
             9       0.73      0.79      0.76        34

     micro avg       0.90      0.90      0.90       285
     macro avg       0.91      0.91      0.91       285
  weighted avg       0.91      0.90      0.90       285

  Confusion matrix
  [[22  0  0  0  0  0  0  0  0  0]
   [ 0 22  0  0  0  0  0  0  0  0]
   [ 0  0 27  0  0  0  0  0  2  0]
   [ 0  0  0 26  0  0  0  0  0  0]
   [ 0  1  0  0 22  0  0  0  0  0]
   [ 0  0  0  0  0 23  0  0  0 10]
   [ 0  1  0  0  0  0 34  0  0  0]
   [ 0  0  0  0  0  0  0 30  3  0]
   [ 0  4  0  0  0  0  0  0 24  0]
   [ 0  0  0  0  2  1  0  2  2 27]]
  Iteration 2 ______________________________________________________________________
  Label Spreading model: 50 labeled & 280 unlabeled (330 total)
                precision    recall  f1-score   support

             0       1.00      1.00      1.00        22
             1       0.85      1.00      0.92        22
             2       1.00      1.00      1.00        28
             3       1.00      1.00      1.00        26
             4       0.87      1.00      0.93        20
             5       0.96      0.70      0.81        33
             6       1.00      0.97      0.99        35
             7       0.94      1.00      0.97        32
             8       0.92      0.86      0.89        28
             9       0.73      0.79      0.76        34

     micro avg       0.92      0.92      0.92       280
     macro avg       0.93      0.93      0.93       280
  weighted avg       0.93      0.92      0.92       280

  Confusion matrix
  [[22  0  0  0  0  0  0  0  0  0]
   [ 0 22  0  0  0  0  0  0  0  0]
   [ 0  0 28  0  0  0  0  0  0  0]
   [ 0  0  0 26  0  0  0  0  0  0]
   [ 0  0  0  0 20  0  0  0  0  0]
   [ 0  0  0  0  0 23  0  0  0 10]
   [ 0  1  0  0  0  0 34  0  0  0]
   [ 0  0  0  0  0  0  0 32  0  0]
   [ 0  3  0  0  1  0  0  0 24  0]
   [ 0  0  0  0  2  1  0  2  2 27]]
  Iteration 3 ______________________________________________________________________
  Label Spreading model: 55 labeled & 275 unlabeled (330 total)
                precision    recall  f1-score   support

             0       1.00      1.00      1.00        22
             1       0.85      1.00      0.92        22
             2       1.00      1.00      1.00        27
             3       1.00      1.00      1.00        26
             4       0.87      1.00      0.93        20
             5       0.96      0.87      0.92        31
             6       1.00      0.97      0.99        35
             7       1.00      1.00      1.00        31
             8       0.92      0.86      0.89        28
             9       0.88      0.85      0.86        33

     micro avg       0.95      0.95      0.95       275
     macro avg       0.95      0.95      0.95       275
  weighted avg       0.95      0.95      0.95       275

  Confusion matrix
  [[22  0  0  0  0  0  0  0  0  0]
   [ 0 22  0  0  0  0  0  0  0  0]
   [ 0  0 27  0  0  0  0  0  0  0]
   [ 0  0  0 26  0  0  0  0  0  0]
   [ 0  0  0  0 20  0  0  0  0  0]
   [ 0  0  0  0  0 27  0  0  0  4]
   [ 0  1  0  0  0  0 34  0  0  0]
   [ 0  0  0  0  0  0  0 31  0  0]
   [ 0  3  0  0  1  0  0  0 24  0]
   [ 0  0  0  0  2  1  0  0  2 28]]
  Iteration 4 ______________________________________________________________________
  Label Spreading model: 60 labeled & 270 unlabeled (330 total)
                precision    recall  f1-score   support

             0       1.00      1.00      1.00        22
             1       0.96      1.00      0.98        22
             2       1.00      0.96      0.98        27
             3       0.96      1.00      0.98        25
             4       0.86      1.00      0.93        19
             5       0.96      0.87      0.92        31
             6       1.00      0.97      0.99        35
             7       1.00      1.00      1.00        31
             8       0.92      0.96      0.94        25
             9       0.88      0.85      0.86        33

     micro avg       0.96      0.96      0.96       270
     macro avg       0.95      0.96      0.96       270
  weighted avg       0.96      0.96      0.96       270

  Confusion matrix
  [[22  0  0  0  0  0  0  0  0  0]
   [ 0 22  0  0  0  0  0  0  0  0]
   [ 0  0 26  1  0  0  0  0  0  0]
   [ 0  0  0 25  0  0  0  0  0  0]
   [ 0  0  0  0 19  0  0  0  0  0]
   [ 0  0  0  0  0 27  0  0  0  4]
   [ 0  1  0  0  0  0 34  0  0  0]
   [ 0  0  0  0  0  0  0 31  0  0]
   [ 0  0  0  0  1  0  0  0 24  0]
   [ 0  0  0  0  2  1  0  0  2 28]]
```

![png](/files/-M-n7jvCNqGFxkvn__Dd)

上圖的結果即為Active Learning訓練過程的結果，第一次迭代以330筆的資料進行訓練，其中包含40筆labeled的資料與290 unlabeled的資料，再對unlabeled的資料做預測，將預測出來的結果中，5個預測最不佳的結果顯示出來，即第一列的5張圖，將這5筆資料的從測試集中強制變為true label的結果，再下一次迭代中，labeled的資料就變成45筆，unlabeled的資料為285筆，總和為330筆的資料進行第二次的訓練，以此類推，因此可以看到，每一次訓練，labeled的資料會5筆、5筆的增加。

## 五、原始碼列表

Python source code: plot\_label\_propagation\_digits\_active\_learning.py

<https://scikit-learn.org/stable/auto_examples/semi_supervised/plot_label_propagation_digits_active_learning.html>

```python
print(__doc__)

# Authors: Clay Woolam <clay@woolam.org>
# License: BSD

import numpy as np
import matplotlib.pyplot as plt
from scipy import stats

from sklearn import datasets
from sklearn.semi_supervised import LabelSpreading
from sklearn.metrics import classification_report, confusion_matrix

digits = datasets.load_digits()
rng = np.random.RandomState(0)
indices = np.arange(len(digits.data))
rng.shuffle(indices)

X = digits.data[indices[:330]]
y = digits.target[indices[:330]]
images = digits.images[indices[:330]]

n_total_samples = len(y)
n_labeled_points = 40
max_iterations = 5

unlabeled_indices = np.arange(n_total_samples)[n_labeled_points:]
f = plt.figure()

for i in range(max_iterations):
    if len(unlabeled_indices) == 0:
        print("No unlabeled items left to label.")
        break
    y_train = np.copy(y)
    y_train[unlabeled_indices] = -1

    lp_model = LabelSpreading(gamma=0.25, max_iter=20)
    lp_model.fit(X, y_train)

    predicted_labels = lp_model.transduction_[unlabeled_indices]
    true_labels = y[unlabeled_indices]

    cm = confusion_matrix(true_labels, predicted_labels,
                          labels=lp_model.classes_)

    print("Iteration %i %s" % (i, 70 * "_"))
    print("Label Spreading model: %d labeled & %d unlabeled (%d total)"
          % (n_labeled_points, n_total_samples - n_labeled_points,
             n_total_samples))

    print(classification_report(true_labels, predicted_labels))

    print("Confusion matrix")
    print(cm)

    # compute the entropies of transduced label distributions
    pred_entropies = stats.distributions.entropy(
        lp_model.label_distributions_.T)

    # select up to 5 digit examples that the classifier is most uncertain about
    uncertainty_index = np.argsort(pred_entropies)[::-1]
    uncertainty_index = uncertainty_index[
        np.in1d(uncertainty_index, unlabeled_indices)][:5]

    # keep track of indices that we get labels for
    delete_indices = np.array([], dtype=int)

    # for more than 5 iterations, visualize the gain only on the first 5
    if i < 5:
        f.text(.05, (1 - (i + 1) * .183),
               "model %d\n\nfit with\n%d labels" %
               ((i + 1), i * 5 + 10), size=10)
    for index, image_index in enumerate(uncertainty_index):
        image = images[image_index]

        # for more than 5 iterations, visualize the gain only on the first 5
        if i < 5:
            sub = f.add_subplot(5, 5, index + 1 + (5 * i))
            sub.imshow(image, cmap=plt.cm.gray_r, interpolation='none')
            sub.set_title("predict: %i\ntrue: %i" % (
                lp_model.transduction_[image_index], y[image_index]), size=10)
            sub.axis('off')

        # labeling 5 points, remote from labeled set
        delete_index, = np.where(unlabeled_indices == image_index)
        delete_indices = np.concatenate((delete_indices, delete_index))

    unlabeled_indices = np.delete(unlabeled_indices, delete_indices)
    n_labeled_points += len(uncertainty_index)

f.suptitle("Active learning with Label Propagation.\nRows show 5 most "
           "uncertain labels to learn with the next model.", y=1.15)
plt.subplots_adjust(left=0.2, bottom=0.03, right=0.9, top=0.9, wspace=0.2,
                    hspace=0.85)
plt.show()
```


---

# Agent Instructions
This documentation is published with GitBook. GitBook is the documentation platform designed so that both humans and AI agents can read, navigate, and reason over technical content effectively. Learn more at gitbook.com.

## Querying This Documentation
If you need additional information that is not directly available in this page, you can query the documentation dynamically by asking a question.

Perform an HTTP GET request on the current page URL with the `ask` query parameter:

```
GET https://machine-learning-python.gitbook.io/project/ex3_label_propagation_digits-demonstrating_performance/ex4_label_propagation_digits_active_learning.md?ask=<question>
```

The question should be specific, self-contained, and written in natural language.
The response will contain a direct answer to the question and relevant excerpts and sources from the documentation.

Use this mechanism when the answer is not explicitly present in the current page, you need clarification or additional context, or you want to retrieve related documentation sections.
