# Machine Learning Basics: Pocket Learning Algorithm and Basic Feature Engineering

[Updated: February 12, 2018]

In the previous article, Machine Learning Basics and Perceptron Learning Algorithm, the assumption was that the Iris Data Set trained by Perceptron Learning Algorithm is linear separable, so the number of misclassification on each training iteration eventually converge to 0. Obviously, this is not always the case in the real world. Therefore, there are questions to be asked: what will happen if a data set is not linear separable? How do we handle this scenario?

The previous article also shows that, at least, versicolor and virginica are not linear separable. (See picture below)

(Source code can be found at iris_example.py)

If we use Perceptron Learning Algorithm to train these two types of iris plant, the weight update function, i.e., the loop in the training function, never stops. The iteration ends eventually because we set a limit of the number of iteration, not because the number of misclassification converges to zero. In addition to that, the output of the trained model is not stable. In other words, the output does not seem to be optimal when the iteration meets the limit.

If we run Perceptron Learning Algorithm on the Iris Data Set with both versicolor and virginica, the trend of the number of misclassification looks like the picture below.

(Source code can be found at iris_example.py)

The picture shows that the trend of the number of misclassification jumps up and down.

Note: the minimum in-sample error does not guarantee the learning model performs the same performance on out-of-sample error. If the in-sample error is low, but the out-of-sample error is high, it is called overfitting, which is not the main topic of this article.

In short, when we use a linear model, like Perceptron Learning Algorithm, on a non-separable data set, the following things may happen:

• The update function never stops
• The final output does not guarantee the in-sample error is optimal

One approach to approximate the optimal in-sample error is to extend the Perceptron Learning Algorithm by a simple modification, so-called Pocket Learning Algorithm.

# Pocket Learning Algorithm

The idea is straightforward: this algorithm keeps the best result seen so far in its pocket (that is why it is called Pocket Learning Algorithm). The best result means the number of misclassification is minimum. If the new weights produce a smaller number of misclassification than the weights in the pocket, then replace the weights in the pocket to the new weights; if the new weights are not better than the one in the pocket, keep the one in the pocket and discard the new weights. At the end of the training iteration, the algorithm returns the solution in the pocket, rather than the last solution.

## The Learning Steps

1. Initialize the pocket weight vector, $W_{pocket}$, to 0 or small random numbers and use this weight vector as the initialized weight vector, $W_0$ of Perceptron Learning Algorithm.
2. For each training iteration, perform the following sub-steps:
1. Run the training step of Perceptron Learning Algorithm to obtain the updated weight vector, $W_t$, where $t$ indicates the current iteration.
2. Evaluate $W_t$ by comparing the number of misclassification on the entire sample set with the number of misclassification performed by $W_{pocket}$.
3. If $W_t$ is better than $W_{pocket}$, replace $W_{pocket}$ to $W_t$.
3. Return $W_{pocket}$ when the training iteration terminates.

The Pocket Learning Algorithm can be implemented merely by extending the Perceptron Learning Algorithm.

import numpy as np
import pandas as pd

class Pocket:
'''The Pocket class keeps the best weights seen so fat in the learning process.

Parameters
----------
number_of_attributes: int
The number of attributes of the data set.

Attributes
----------
best_weights: list of float
The list of the best weights seen so far.

misclassify_count: int
The number of misclassification corresponding to the best weights.
'''
def __init__(self, number_of_attributes):
self.best_weights = np.zeros(number_of_attributes + 1)
self.misclassify_count = -1 # -1 means the class is initialized but does not have valid value

class PocketClassifier:
'''Pocket Binary Classifier uses modified Perceptron Learning Algorithm called
Pocket Learning Algorithm to classify two classes data.

Parameters
----------
number_of_attributes : int
The number of attributes of the data set.

class_labels : tuple of the class labels
The class labels can be anything as long as it has only two types of labels.

Attributes
----------
pocket: Pocket
The pocket contains the best training result so far and
the number of the misclassified sample according to the result in the pocket.

weights: list of float
The list of weights corresponding input attributes.

misclassify_record: list of int
The number of misclassification for each training sample.
'''
def __init__(self, number_of_attributes: int, class_labels: ()):
# Initialize the Pocket class
self.pocket = Pocket(number_of_attributes)
# Initialize the weights to zero
# The size is the number of attributes plus the bias, i.e. x_0 * w_0
self.weights = np.zeros(number_of_attributes + 1)

# Record of the number of misclassify for each training sample
self.misclassify_record = []

# Build the label map to map the original labels to numerical labels
# For example, ['a', 'b'] -> {0: 'a', 1: 'b'}
self._label_map = {1: class_labels[0], -1: class_labels[1]}
self._reversed_label_map = {class_labels[0]: 1, class_labels[1]: -1}

def _linear_combination(self, sample):
'''linear combination of sample and weights'''
return np.inner(sample, self.weights[1:])

def train(self, samples, labels, max_iterator=10):
'''Train the model

Parameters
----------
samples: two dimensions list
Training data set
labels: list of labels
The class labels of the training data
max_iterator: int
The max iterator to stop the training process.
'''
# Transfer the labels to numerical labels
transferred_labels = [self._reversed_label_map[index] for index in labels]

for _ in range(max_iterator):
misclassifies = 0
for sample, target in zip(samples, transferred_labels):
linear_combination = self._linear_combination(sample)
update = target - np.where(linear_combination >= 0.0, 1, -1)

# use numpy.multiply to multiply element-wise
self.weights[1:] += np.multiply(update, sample)
self.weights[0] += update

# record the number of misclassification
misclassifies += int(update != 0.0)

# Update the pocket is the result is better than the one in the pocket
if (self.pocket.misclassify_count == -1) or \
(self.pocket.misclassify_count > misclassifies) or (misclassifies == 0):

self.pocket.best_weights = self.weights
self.pocket.misclassify_count = misclassifies

if misclassifies == 0:
break

self.misclassify_record.append(self.pocket.misclassify_count)

def classify(self, new_data):
'''Classify the sample based on the trained weights

Parameters
----------
new_data: two dimensions list
New data to be classified

Return
------
List of int
The list of predicted class labels.
'''
predicted_result = np.where((self._linear_combination(new_data) + self.weights[0]) >= 0.0, 1, -1)
return [self._label_map[item] for item in predicted_result]

(Complete source code can be found at pocket_classifier.py)

Note that this algorithm is straightforward, but not perfect. The results seem stochastic because the optimal result is the best result seen so far.

# Apply Pocket Learning Algorithm onto Japanese Credit Screening Data Set

It is always easy to illustrate an idea by an example. This section uses a data set form University of California, Irvine, Machine Learning repository. (Lichman, M. (2013). UCI Machine Learning Repository [http://archive.ics.uci.edu/ml]. Irvine, CA: University of California, School of Information and Computer Science.) This data set, Japanese Credit Screening Data Set, has total 125 samples and two classes, positives, and negatives, of people who were and were not granted credit. Each sample has 15 features; some features may be missed on some examples. Besides, the type of features includes categorical, real, and integer types. This example demonstrates how to use e Japanese Credit Screening Data Set to train the Pocket Learning Algorithm so that the learning model can be used to determine if an applicant will be approved or not on unseen data in the future.

However, most machine learning algorithms have a strong assumption that the data set is numerical. Pocket Learning Algorithm is not an exception. For Pocket Learning Algorithm to be able to work on the Japanese Credit Screening Data Set, we need to process the non-numerical features to be numerical. This type of process is usually called feature engineering. The next section provides a brief introduction to feature engineering.

# Basic Feature Engineering

The quality of the data and the amount of useful information it consists of are the key points that determine how well the performance of a machine learning algorithm can learn. In other words, using raw data sets directly can negatively affect the performance of a learning process. Therefore, feature engineering is typically the first step in a machine learning process.

## Training and Test Sets

Overfitting is a common pitfall of machine learning processes, which could happen when a learning process achieve 0 in-sample error, but a huge out-of-sample error on yet-unseen data. To avoid overfitting, a common practice is that instead of using the entire data set as the training set, the machine learning process holds part of the data as a test data set, and only use the rest of the data to train the model. Once the learning model is trained, we then use the test data set to verify the model’s performance.

In the example that the article, Machine Learning Basics and Perceptron Learning Algorithm, demonstrates, we manually separate the Iris Data Set to a training set and a test set by taking the first 10 samples from each class and aggregate them together to a test set; the rest of samples are training set. In fact, this is not what we usually would do (not what we should do either) regarding split data set. If a data set has been affected at any step in the learning process, its performance to access the outcome has been compromised. This issue is called data snooping. By manually manipulating the data set, it may result in data snooping bias. (Data snooping bias is out of the topic of this article and will be discussed in the future articles.)

The appropriate way to create a training and a test sets is that the test data set should be picked randomly, and meet the following criteria:

• Both the training and the test sets must reflect the original distribution.
• The original data set must be randomly shuffled.

Scikit-learn, an open source machine learning library, provides a nice and convenient function to split a data set into a training and a test sets. The following example demonstrates how to use scikit-learn split function on the Iris Data Set along with the Perceptron Learning Algorithm.

import pandas
import urllib.request
from perceptron_classifier import PerceptronClassifier

# http://scikit-learn.org/stable/modules/generated/sklearn.model_selection.train_test_split.html
from sklearn.model_selection import train_test_split

url = 'http://archive.ics.uci.edu/ml/machine-learning-databases/iris/iris.data'
urllib.request.urlretrieve(url, 'iris.data')

# Try only versicolor and virginica
LABELS = IRIS_DATA.iloc[50:150, 4].values
DATA = IRIS_DATA.iloc[50:150, [0, 2]].values

# Use scikit-learn's train_test_split function to separate the Iris Data Set
# to a training subset (75% of the data) and a test subst (25% of the data)
DATA_TRAIN, DATA_TEST, LABELS_TRAIN, LABELS_TEST = train_test_split(DATA, LABELS, test_size=0.25, random_state=1000)

perceptron_classifier = PerceptronClassifier(2, ('Iris-versicolor', 'Iris-virginica'))
perceptron_classifier.train(DATA_TRAIN, LABELS_TRAIN, 100)
result = perceptron_classifier.classify(DATA_TEST)

(The complete example can be found at iris_training_and_test_example.py)

## Categorical Data

Once again, most machine learning algorithms only accept numerical data. To train a sample contains categorical data, we need a process to encode categorical data to numerical data. Encoding categorical data to numerical data can be distinguished between ordinal and nominal. An ordinal number is a number that can be sorted or ordered. In other words, ordinals indicate rank and position. For instance, education level could be ordinal: graduate school, college, high school, which can be ranked by graduate school > college > high school. In contrast, a nominal number is numeric symbols for labeling only. The values of numerals do not imply any order, quantity, or any other measurement. For example, marital status: single, married, divorced, widowed. It does not make sense to say single is higher or larger (or another measurement) than married.

### Encoding Ordinal Data

To make sure a machine learning algorithm interprets the ordinal data appropriately, we need to transfer the categorical data to numerical data along with the sense of order or rank. Since encoding ordinal data with proper position requires the knowledge of the data, there is no convenient function to automatically convert the categorical data to numerical data with correct data. Thus, we need to define the mapping manually based on the understanding of the data.

Take the education level as an example; the mapping could be graduate school = college + 1 = high school + 2.

### Encoding Nominal Data

One common mistake in dealing with categorical data is that treat nominal data as ordinal data. Take the marital status as an example; if we create a 1-to-1 mapping between marital status and its corresponding numerical number, the mapping may look like following:

• Single -> 0
• Married -> 1
• Divorced -> 2
• Widowed ->3

Although the martial values do not apply any order or rank, a machine learning algorithm may assume widowed is higher or larger (or another measurement) than divorced, and so on. Based on this incorrect assumption, the machine learning may still produce useful output. However, the performance may not be optimal. One approach to handle this scenario is one-hot encoding. The basic idea is to create dummy features for each unique value in the original categorical data. For example, marital status can be encoded as the following.

• Single -> (1, 0, 0, 0)
• Married -> (0, 1, 0,0)
• Divorced -> (0, 0, 1, 0)
• Widowed -> (0, 0, 0, 1)

This way, the machine learning algorithm treats the feature as different labels instead of assuming the feature has rank or order.

The following example demonstrates how to encode categorical data.

from sklearn import preprocessing
import numpy as np
import pandas as pd
import urllib.request

def one_hot_encoder(data=[]):
'''Transfer categorical data to numerical data based on one hot encoding approach.

Parameters
----------
data: one dimension list

Return
------
The list of the encoded data based on one hot encoding approach
'''
# Since scikit-learn's OneHotEncoder only accepts numerical data, use LabelEncoder to transfer the
# categorical data to numerical by using simple encoding approach.
# For example, t -> 0; f -> 1
# http://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.LabelEncoder.html
LABEL_ENCODER = preprocessing.LabelEncoder()
numerical_data = LABEL_ENCODER.fit_transform(data)
two_d_array = [[item] for item in numerical_data]

# Use scikit-learn OneHotEncoder to encode the A9 feature
# http://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.OneHotEncoder.html
encoder = preprocessing.OneHotEncoder()
encoder.fit(two_d_array)
return encoder.transform(two_d_array).toarray()

if __name__ == '__main__':
URL = 'http://archive.ics.uci.edu/ml/machine-learning-databases/credit-screening/crx.data'
urllib.request.urlretrieve(URL, 'crx.data')

# Use one-hot encoding to transfer the A9 attribute of the Japanese Credit Data Set
# A9:   t, f.
# The encoded output may look like
# t -> [0, 1]
# f -> [1, 0]

A9 = crx_data.iloc[:, 8].values
A9_encoded = one_hot_encoder(A9)

(The complete example can be found at one_hot_encode_example.py)

## Missing Features

The Japanese Credit Screening Data Set has missing features on some samples, which is not an uncommon issue in real-world machine learning problems. Fortunately, a dataset has missing fields, or attributes does not mean it is not useful. Several approaches can be used to fill up the empty gaps.

### Remove them

As the title states, this approach is to discard entire rows and columns which contain missing values. Removing is the most straightforward option but should be considered to use only if the data set is large, for an apparent reason: the fewer samples to train, the harder to avoid overfitting. However, sometime, dropping rows may be sufficient. For example, a feature has missing values on some example may be just noise. In this case, remove the entire feature column may be useful. However, this type of information may be unknown.

### Predict the missing values

The basic idea is to use existing data to predict the missing ones. Apparently, this is not a natural approach. It means we need to define a supervised learning problem, solve it, and use this sub-model to predict the missing data. This way is not only hard to evaluate the performance of the sub-model, but also the accuracy of the sub-model’s prediction affects the learning performance of the overall problem.

### Fill up the missing values based on the known values

The first two options are either too difficult to achieve or not practical at all. One reasonable and rational approach is to input the missing data according to the other known values. This strategy replaces missing values by something that can be obtained from existing values such as mean, median, and frequency. The following example demonstrates how to input missing value by frequency and mean.

from sklearn import preprocessing
import numpy as np
import pandas as pd
import urllib.request
from collections import Counter

def imputer_by_most_frequent(missing_values=np.nan, data=[]):
'''Input missing value by frequency, i.e., the value appeared most often.

Parameters
----------
missing_values:
The missing value can be np.nan, '?', or whatever character which indicates a missing value.

data: one dimension list

Return
------
The list of the encoded data based on one hot encoding approach
'''
# Find the value appeared most often by using Counter.
most = Counter(data).most_common(1)[0][0]
complete_list = []
for item in data:
if item is missing_values:
item = most
complete_list.append(item)
return complete_list

if __name__ == '__main__':
URL = 'http://archive.ics.uci.edu/ml/machine-learning-databases/credit-screening/crx.data'
urllib.request.urlretrieve(URL, 'crx.data')

# Use the input by most frequent approach to input the missing values in A1
# A1:   b, a
#
# Use the input by mean number approach to input the missing values in A2
# A2:   continuous

# Since the Japanese Credit Data Set uses '?' to denote missing, replace it to np.nan.
# scikit-learn's Imputer only accepts np.nan or integer, therefore, convert '?' to np.nan.
# This transformation is for A2 which uses scikit-learn's Imputer.
# For A1 which uses imputer_by_most_frequent(), this transformation is not necessary.
crx_data.replace('?', np.nan, inplace=True)

A1_no_missing = imputer_by_most_frequent(np.nan, crx_data.iloc[:, 0].values)

# Use scikit-learn Imputer to input missing values by mean number.
# http://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.Imputer.html
imputer = preprocessing.Imputer(missing_values=np.nan, strategy='mean', axis=0)
# Convert to two-dimension list, since Imputer only accepts two dimensions list.
A2_two_d = np.array([[item] for item in crx_data.iloc[:, 1].values])
A2_no_missing = imputer.fit_transform(A2_two_d)

(The complete example can be found at input_missing_example.py)

## Feature Scaling

The majority machine learning algorithms accept whatever numerical data. However, most machine learning algorithms perform better result if features are on the same scale. A machine learning algorithm does not naturally have the sense of various scales on each feature. It is not difficult to get the idea of the importance of feature scaling. For example, assume we have a data set with two features: the first feature is range from 1 to 10, and the scope of the second feature is from 1 to 1000. When we apply Perceptron Learning Algorithm on this data set, it is intuitive that the update function is heavily affected by the second feature. Therefore, it may result in a bias output.

Typically, there are two approaches to bring features onto the same scale: normalization and standardization.

### Normalization

Normalization means adjusting values measured on different scales to a common scale. Most often, normalization also refers to rescale the features to a range of [0, 1]. In fact, it can be bounded on whatever boundary we want. Normalization can be achieved by the following formula, min-max scaling:

$x_{norm} = \frac{x - x_{min}}{x_{max} - x_{min}}$, where $x$ is a sample, $x_{min}$ is the smallest value in the feature column, and $x_{max}$ is the largest value.

Although normalization is useful when we need values in a bounded interval, it does have a drawback: if a data set has outliers (which is normal), the normalization equation may be heavily affected by outliers, i.e., the $x_{max}$ may be huge, and $x_{min}$ may be tiny. Typically, the normal data may be normalized to a very small interval.

### Standardization

In contrast to normalization bounds values in an interval, standardization centers the features at mean with standard deviation, so that the features become the form of normal distribution. The process of standardization can be expressed by the following formula:

$x_{std} = \frac{x - \mu}{\sigma}$ where $\mu$ is the sample mean of a particular feature column and $\sigma$ is the corresponding standard deviation.

This way, standardized data keeps useful information about outliers and makes a learning algorithm less sensitive to the outliers. However, standardization has a drawback too. Unlike normalization, standardization is not bounded.

# Example

Now, we have enough tools to demonstrate the Pocket Learning Algorithm on the Japanese Credit Screening Data Set.

# pandas is an open source library providing high-performance,
# easy-to-use data structures and data analysis tools. http://pandas.pydata.org/
import pandas as pd

# NumPy is the fundamental package for scientific computing with Python.
# http://www.numpy.org/
import numpy as np

# Seaborn is a Python visualization library based on matplotlib.
# http://seaborn.pydata.org/index.html#
import seaborn as sns

# matplotlib is a python 2D plotting library which produces publication quality
# figures in a variety of hardcopy formats and interactive environments across platforms.
# http://matplotlib.org/2.0.0/index.html
from matplotlib import pyplot

from pocket_classifier import PocketClassifier
from perceptron_classifier import PerceptronClassifier
from collections import Counter
from urllib import request
from sklearn import preprocessing
from sklearn import model_selection

# Set aesthetic parameters in one step.
sns.set()

def imputer_by_most_frequent(missing_values=np.nan, data=[]):
'''Input missing value by frequency, i.e., the value appeared most often.

Parameters
----------
missing_values:
The missing value can be np.nan, '?', or whatever character which indicates a missing value.

data: one dimension list

Return
------
The list of the encoded data based on one hot encoding approach
'''
# Find the value appeared most often by using Counter.
most = Counter(data).most_common(1)[0][0]
complete_list = []
for item in data:
if item is missing_values:
item = most
complete_list.append(item)
return complete_list

def one_hot_encoder(data=[]):
'''Transfer categorical data to numerical data based on one hot encoding approach.

Parameters
----------
data: one dimension list

Return
------
The list of the encoded data based on one hot encoding approach
'''
# Since scikit-learn's OneHotEncoder only accepts numerical data, use LabelEncoder to transfer the
# categorical data to numerical by using simple encoding approach.
# For example, t -> 0; f -> 1
# http://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.LabelEncoder.html
LABEL_ENCODER = preprocessing.LabelEncoder()
numerical_data = LABEL_ENCODER.fit_transform(data)
two_d_array = [[item] for item in numerical_data]

# Use scikit-learn OneHotEncoder to encode the A9 feature
# http://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.OneHotEncoder.html
encoder = preprocessing.OneHotEncoder()
encoder.fit(two_d_array)
return encoder.transform(two_d_array).toarray()

if __name__ == '__main__':
URL = 'http://archive.ics.uci.edu/ml/machine-learning-databases/credit-screening/crx.data'
request.urlretrieve(URL, 'crx.data')
crx_data.replace('?', np.nan, inplace=True)

# Transfer the category data to numerical data and input missing data:
# A1:   b, a. (missing)
# A2:   continuous. (missing) mean
# A3:   continuous.
# A4:   u, y, l, t. (missing) frequency
# A5:   g, p, gg.  (missing) frequency
# A6:   c, d, cc, i, j, k, m, r, q, w, x, e, aa, ff. (missing) frequency
# A7:   v, h, bb, j, n, z, dd, ff, o. (missing) frequency
# A8:   continuous.
# A9:   t, f.
#A10:   t, f.
#A11:   continuous.
#A12:   t, f.
#A13:   g, p, s.
#A14:   continuous. (missing) mean
#A15:   continuous.
#A16:   +,-         (class label)

A1_no_missing = imputer_by_most_frequent(np.nan, crx_data.iloc[:, 0].values)
A1_encoded = one_hot_encoder(A1_no_missing)

imputer = preprocessing.Imputer(missing_values=np.nan, strategy='mean', axis=0)
A2_two_d = np.array([[item] for item in crx_data.iloc[:, 1].values])
A2_no_missing = imputer.fit_transform(A2_two_d)

A3 = crx_data.iloc[:, 2].values

A4_no_missing = imputer_by_most_frequent(np.nan, crx_data.iloc[:, 3].values)
A4_encoded = one_hot_encoder(A4_no_missing)

A5_no_missing = imputer_by_most_frequent(np.nan, crx_data.iloc[:, 4].values)
A5_encoded = one_hot_encoder(A5_no_missing)

A6_no_missing = imputer_by_most_frequent(np.nan, crx_data.iloc[:, 5].values)
A6_encoded = one_hot_encoder(A6_no_missing)

A7_no_missing = imputer_by_most_frequent(np.nan, crx_data.iloc[:, 6].values)
A7_encoded = one_hot_encoder(A7_no_missing)

A8 = crx_data.iloc[:, 7].values

A9_encoded = one_hot_encoder(crx_data.iloc[:, 8].values)

A10_encoded = one_hot_encoder(crx_data.iloc[:, 9].values)

A11 = crx_data.iloc[:, 10].values

A12_encoded = one_hot_encoder(crx_data.iloc[:, 11].values)

A13_encoded = one_hot_encoder(crx_data.iloc[:, 12].values)

A14_two_d = np.array([[item] for item in crx_data.iloc[:, 13].values])
A14_no_missing = imputer.fit_transform(A14_two_d)

A15 = crx_data.iloc[:, 14].values

# Aggregate all the encoded data together to a two-dimension set
data = list()
label = list()
for index in range(690):
temp = np.append(A1_encoded[index], A2_no_missing[index])
temp = np.append(temp, A3[index])
temp = np.append(temp, A4_encoded[index])
temp = np.append(temp, A5_encoded[index])
temp = np.append(temp, A6_encoded[index])
temp = np.append(temp, A7_encoded[index])
temp = np.append(temp, A8[index])
temp = np.append(temp, A9_encoded[index])
temp = np.append(temp, A10_encoded[index])
temp = np.append(temp, A11[index])
temp = np.append(temp, A12_encoded[index])
temp = np.append(temp, A14_no_missing[index])
temp = np.append(temp, A15[index])
data.append(temp.tolist())
label.append(crx_data[15][index])

# Use scikit-learn's MinMaxScaler to scale the training data set.
# http://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.MinMaxScaler.html
min_max_scaler = preprocessing.MinMaxScaler()
data_minmax = min_max_scaler.fit_transform(data)

features = len(data[0])

# Use scikit-learn's train_test_split function to separate the Iris Data Set
# to a training subset (75% of the data) and a test subst (25% of the data)
DATA_TRAIN, DATA_TEST, LABELS_TRAIN, LABELS_TEST = model_selection.train_test_split(data_minmax, label, test_size=0.25, random_state=1000)

pocket_classifier = PocketClassifier(features, ('+', '-'))
pocket_classifier.train(DATA_TRAIN, LABELS_TRAIN, 100)

result = pocket_classifier.classify(DATA_TEST)

misclassify = 0
for predict, answer in zip(result, LABELS_TEST):
misclassify += 1
print("Accuracy rate: %2.2f" % (100 * (len(result) - misclassify) / len(result)) + '%')

(Source code can be found at pocket_learning_algorithm_example.py)

Note that the purpose of this example is to demonstrate Pocket Learning Algorithm and the idea of feature engineering. Therefore, the example may not be the most efficient regarding the feature engineering and python programming; the result (about 85% accurate) can be improved as well.

# Machine Learning Basics and Perceptron Learning Algorithm

[Updated: May 02, 2018]

Machine learning is a term that people are talking about often in the software industry, and it is becoming even more popular day after day. Media is filled with many fancy machine learning related words: deep learning, OpenCV, TensorFlow, and more. Even people who are not in the software industry are trying to leverage the power of machine learning. It is not surprising that its application is becoming more widespread day by day in every business.

However, it is easy to neglect or just simply forget the fundamentals of machine learning when our minds are filled with so many amazing machine learning ideas and terminologies. This article aims not only to review the fundamentals of machine learning, but also to give a brief concept of machine learning for people who learn machine learning the first time, so they know what machine learning is, how does it work, how to do it well, and realize that machine learning is not magic.

Hopefully, experienced readers can benefit from reviewing the fundamental concepts of machine learning, and newbie can get the basic idea of machine learning from this article.

# What is Machine Learning? What can it do?

Machine learning is a subfield of Artificial Intelligence. As its name applies, it tries to help machines ‘’learn’’ something from data. Recall that how did we learn to recognize an animal, for instance, a dog, when we were little kids? We probably started from reading a book which contains dog pictures, i.e., the data, and then we know it is a dog when encountered a dog after. That is the essence of machine learning: learning from data.

It is always easy to explain a concept with an example. This Kaggle Competition, Predict Grant Applications, which I participated several years ago, awarded participants whoever can provide a model to help them simplify and accelerate the review process to reduce the cost.

The traditional way to determine if an applicant qualifies the grant is that the academic reviewers spend their time to review applicants’ documents and make decisions. This process takes quite an amount of time; especially this school claims that majority applications end up with rejection. The valuable academic reviewers’ time is wasted.

Here is where machine learning can help. Machine learning approach can preliminary filter the applications whom the machine learning approach predicts have high chance to succeed. Therefore, the review committee can spend their time on these candidates.

Therefore, the main goal of machine learning is to deduce the future or to make decisions by training mathematical models with context-related data.

## How does it work?

Machine learning starts from the data. Before answering this question, let’s first look at the data. The following table depicts an excerpt of the Predict Grant Applications Data Set. This table demonstrates the common notations used in machine learning areas such as samples, features, and labels.

So, how does machine learning work? First of all, everybody would agree that some attributes have a heavier weight than others in the Predict Grant Applications example. As well as the review committee, during their reviewing, the academic reviewer may have the sense that some attributes are more important than others. For instance, the number of journal articles is probably more important than the application’s Person ID. This fact implies that a pattern of a successful application exists. Second, even if we know the pattern exists, it is impossible (or very hard) to write down a math formula to decide whom would be granted. Otherwise, we would just write a program by using the math formula to solve this problem at once. Third, this school has a historical record of this program, i.e., they have data. These matters conclude the nature of using machine learning to solve a problem if the problem has the following characteristics:

1. A pattern exists. (in this Predict Grant Applications example, the reviewers have a feeling that some attributes are more important than others.)
2. No easy way to solve this problem by a math equation. (probably this is the main reason why we need machine learning approach. A math formula solution is always better than machine learning solution, if we can find one.)
3. We have data. (in this example, the University of Melbourne has historical records of applications; without data, there is nothing machine learning can do)

Once a problem meets these three points, machine learning is ready to solve this problem. A basic machine learning approach has the following components.

• Target function: an unknown ideal function. It exists, but unknown
• Hypothesis set which contains all the possible functions. For example, Perceptron, Neural Network, Support Vector Machine, and so on.
• Learning algorithm to pick the optimal function from the hypothesis set based on the data. For instance, Perceptron Learning Algorithm, backpropagation, quadratic programming, and so forth.
• Learning model: normally, the combination of hypothesis set and learning algorithm can be referred as a learning model.

There are many tools, i.e., learning models, to work on machine learning problems. Therefore, the first step is to pick up a learning model to start. Each learning model has its strength and weakness; some may be good at the certain problem; some may be not. How to choose an appropriate learning model is out of the topic of this article, but, at least, we can always pick one as a starting point.

As soon as a learning model picked, we can start training the model by feeding data. Take the Predict Grant Application as an example again; this process starts with random factors, i.e., the weights of each attribute. Then turns these factors to make them more and more aligned with how the academic reviewers have reviewed the applications before by feeding the historical records until they are ultimately able to predict how reviewers determine if applicants are granted in general. This process is so called training model.

After the model is trained, the model can predict the results with feeding the new data. In the Predict Grant Application example, the new data could be the current year’s applications. The university can use this model to filter out the applications that unlikely succeed, so the review committee can focus on the applications that have a high chance to succeed.

A benefit of the machine learning approach is that it can be generalized to much larger datasets in many more dimensions. It is trivial when the problem is in a low number of dimensions, i.e., number of attributes. However, the reality is that the data set that University of Melbourne provided has 249 features, which makes this problem much harder (and almost impossible pin it down to a simple math formula). Fortunately, the power of machine learning is that it can be straightforwardly applied and evaluated in the case of data with many more features.

Next section uses probably the simplest learning model to demonstrate the basic workflow of machine learning approach.

# A Simple Example: Perceptron Learning Algorithm

This example uses a classic data set, Iris Data Set, which contains three classes of 50 instances each, where each class refers to a type of iris plant. The goal of this example is to use machine learning approach to build a program to classify the type of iris flowers.

## Problem Setup

The Iris Data Set contains three classes (classes normally can be referred to labels as well): Iris Setosa, Iris Versicolour, and Iris Virginica.

Besides classes, each instance has following attributes:

• Sepal length in cm
• Sepal width in cm
• Petal length in cm
• Petal width in cm

Each instance, a.k.a., sample, of the Iris Data Set looks like (5.1, 3.5, 1.4, 0.2, Iris-setosa)

The learning model this example chooses is Perceptron and Perceptron Learning Algorithm.

## Perceptron Learning Algorithm

Perceptron Learning Algorithm is the simplest form of artificial neural network, i.e., single-layer perceptron. A perceptron is an artificial neuron conceived as a model of biological neurons, which are the elementary units in an artificial neural network. An artificial neuron is a linear combination of certain (one or more) inputs and a corresponding weight vector. That is to say that a perceptron has the following definitions:

Given a data set $D$ which contains training data set $X$ and output labels $Y$, and can be formed as a matrix.

Each $x_i$ is one sample of $X$, $x_i = \left\{ x_{i1}, x_{i2}, ..., x_{in} \right\} \forall i \in N, i = 1, 2,..., n$
Each $y_i$ is the actual label and has a binary value: $\left\{ -1, 1 \right\}$.
$W$ is the weight vector. Each $x_{ij}$ has a correspoinding weight $w_j$
Since a perceptron is a linear combination of $X$ and $W$, it can be denoted as $\begin{bmatrix} x_{11} & \cdots & x_{1n} \\ \vdots & x_{ij} & \vdots \\ x_{m1} & \cdots & x_{mn} \end{bmatrix} \begin{bmatrix} w_1 \\ \vdots \\ w_n \end{bmatrix}$
For each neuron $y_j$, the output is $y_j = \phi \left( \displaystyle\sum_{j=1}^{n}x_{ij}w_j \right)$, where the $\phi$ is the transfer function:
$\phi = \begin{cases} 1 & \quad \text{if } \sum_{j=1}^{n}x_{ij}w_j>\theta \text{ where } \theta \text{ is the predefined threshold} \\ -1 & \quad \text{otherwise} \end{cases}$
For the sake of simplicity, $\theta$ can be treated as $x_{i0}w_0$, where $x_{i0} = 1$
Therefore, the linear combination of $X$ and $W$ can be rewritten as $\begin{bmatrix} x_{01} & \cdots & x_{0n} \\ \vdots & x_{ij} & \vdots \\ x_{m1} & \cdots & x_{mn} \end{bmatrix} \begin{bmatrix} w_0 \\ \vdots \\ w_n \end{bmatrix}$
The output of jth neuron is $y_j = \phi \left( \displaystyle\sum_{j=0}^{n}x_{ij}w_j \right)$, where the $\phi$ is the transfer function:
$\phi = \begin{cases} 1 & \quad \text{if } \sum_{j=0}^{n}x_{ij}w_j>0 \\ -1 & \quad \text{otherwise} \end{cases}$

## The Learning Steps

Given the definition of Perception, the Perceptron Learning Algorithms works by the following steps:

1. Initialize the weights to 0 or small random numbers.
2. For each training sample, $x_i$, perform the following sub steps:
1. Compute $\phi$, the linear combination of $X$ and $W$, to get the predicted output, class label $p_j$.
2. Update the weights $W$
3. Record the number of misclassification
3. If the number of misclassification is not 0 after training the whole set, $X$, repeat steps 2 and start from the beginning of the trainin set, i.e., start from $x_{0j}$. Repeat this step until the number of mislassification is 0.

Note:

The output of 2.1 is the class predicted by the $\phi$ function.

In the step 2.2, the update of each weight, $w_j$, of $W$ follows the rule:

$w_j \left( t + 1\right) = w_j \left( t \right) + \Delta w_j = w_j \left( t \right) + \left( p_j - y_j \right) x_{ij}$, where $t$ indicatees the step: $t$ means current step, and $t+1$ means next step. Therefore, $w_j\left( t \right)$ indicates the current weight and $w_j\left( t+1 \right)$ indicates the weight after updated.

If no misclassify, $\Delta w_j = \left( -1 - \left( -1 \right) \right) x_{ij} = 0$ or $\Delta w_j = \left( 1 - 1 \right) x_{ij} = 0$. In this case, $w_j \left( t + 1 \right) = w_j \left( t \right)$. No update.

If misclassify, $\Delta w_j = \left( -1 - 1 \right) x_{ij} = -2 x_{ij}$ or $\Delta w_j = \left( 1 - \left( -1 \right) \right) x_{ij} = 2 x_{ij}$. In this case, $w_j \left( t + 1 \right) = w_j \left( t \right) + 2 x_{ij}$ or $w_j \left( t + 1 \right) = w_j \left( t \right) - 2 x_{ij}$. Weight updates.

In the step 2.3, the convergence of PLA is only guaranteed if the two classes are linearly separable. If they are not, the PLA never stops. One simple modification is Pocket Learning Algorithm, which is discussed at Machine Learning Basics: Pocket Learning Algorithm and Basic Feature Engineering.

The Perceptron Learning Algorithm can be simply implemented as following:

import numpy as np

class PerceptronClassifier:
'''Preceptron Binary Classifier uses Perceptron Learning Algorithm
to do classification with two classes.

Parameters
----------
number_of_attributes : int
The number of attributes of data set.
class_labels: tuple of the class labels
The class labels can be anything as long as it has only two types of labels.
Attributes
----------
weights : list of float
The list of weights corresponding input attributes.

errors_trend : list of int
The number of misclassification for each training sample.
'''
def __init__(self, number_of_attributes: int, class_labels: ()):
# Initialize the weigths to zero
# The size is the number of attributes plus the bias, i.e. x_0 * w_0
self.weights = np.zeros(number_of_attributes + 1)

# Record of the number of misclassify for each train sample
self.misclassify_record = []
# Build the label map to map the original labels to numerical labels
# For example, ['a', 'b'] -> {0: 'a', 1: 'b'}
self._label_map = {1: class_labels[0], -1: class_labels[1]}
self._reversed_label_map = {class_labels[0]: 1, class_labels[1]: -1}

def _linear_combination(self, sample):
'''linear combination of sample and weights'''
return np.inner(sample, self.weights[1:])

def train(self, samples, labels, max_iterator=10):
'''Train the model

Parameters
----------
samples : two dimensions list
Training data set
labels : list of labels
The class labels of the training data
max_iterator : int
The max iterator to stop the training process
in case the training data is not converaged.
'''
# Transfer the labels to numerical labels
transfered_labels = [self._reversed_label_map[index] for index in labels]

for _ in range(max_iterator):
misclassifies = 0
for sample, target in zip(samples, transfered_labels):
linear_combination = self._linear_combination(sample)
update = target - np.where(linear_combination >= 0.0, 1, -1)

# use numpy.multiply to multiply element-wise
self.weights[1:] += np.multiply(update, sample)
self.weights[0] += update

# record the number of misclassification
misclassifies += int(update != 0.0)

if misclassifies == 0:
break
self.misclassify_record.append(misclassifies)

def classify(self, new_data):
'''Classify the sample based on the trained weights

Parameters
----------
new_data : two dimensions list
New data to be classified

Return
------
List of int
The list of predicted class labels.
'''
predicted_result = np.where((self._linear_combination(new_data) + self.weights[0]) >= 0.0, 1, -1)
return [self._label_map[item] for item in predicted_result]

(Source code can be found here)

## Apply Perceptron Learning Algorithm onto Iris Data Set

Normally, the first step to apply machine learning algorithm to a data set is to transform the data set to something or format that the machine learning algorithm can recognize. This process may involve normalization, dimension reduction, and feature engineering. For example, most machine learning algorithms only accept numerical data. Therefore, a data set needs to transfer to a numerical format.

In the Iris Data Set, the attributes, sepal length, sepal width, petal length, and petal width, are a numerical value, but the class labels are not. Therefore, in the implementation of PerceptronClassifier, the train function transfers the class labels to a numerical format. A simple way is to use numbers to indicates these labels: , which means 0 indicates Setosa, 1 implies Versicolour, and 2 means Virginica. Then, the Iris Data Set can be viewed as the form below to feed into Perceptron Learning Algorithm.

$X = \left\{ x_0, x_1, x_2, x_3, x_4 \right\} = \left\{ 1, sepal-length, sepal-width, petal-legnth, petal-width \right\}$

$Y = \left\{ 0, 1, 2 \right\}$

Perceptron is a binary classifier. However, the Iris Data Set has three labels. There are two common ways to deal with multiclass problems: one-vs-all and one-vs-one. For this section, we use a simplified one-vs-one strategy to determinate the types of the iris plant.

One-vs-one approach trains a model for each pair of classes and determines the correct class by a majority vote. For example, the Iris Data Set has three classes. That means all the combinations of a pair of the three classes. That means $\left\{ \left\{ setosa, versicolour \right\}, \left\{ setosa, virginica \right\}, \left\{ versicolour, virginica \right\} \right\}$.

Besides, machine learning is not restricted to use all features that the data set has. Instead, only important features are necessary. Here we only consider the two features, Sepal width, and Petal width. In fact, choosing the right features is so important that there is a subject called Feature Engineering to deal with this problem.

import numpy as np
import pandas as pd
import urllib.request
from perceptronlib.perceptron_classifier import PerceptronClassifier

url = 'http://archive.ics.uci.edu/ml/machine-learning-databases/iris/iris.data'
urllib.request.urlretrieve(url, 'iris.data')

# Prepare the training data and test data
# The original Iris Data Set has 150 samples and 50 samples for each class
# This example takes first 40 samples of each class as training data,
# and the other 10 samples of each class as testing data.
# So, this example uses the testing data to verify the trained Perceptron learning model.
# 0 ~ 39: setosa training set
# 40 ~ 49: setosa testing set
# 50 ~ 89 versicolor training set
# 90 ~ 99: versicolor testing set
# 100 ~ 139: virginica training set
# 140 ~ 149: virginica testing set
# Use pandas iloc to select samples by position and return an one-dimension array
# http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.iloc.html#pandas.DataFrame.iloc
SETOSA_LABEL = IRIS_DATA.iloc[0:40, 4].values
VERSICOLOR_LABEL = IRIS_DATA.iloc[50:90, 4].values
VIRGINICA_LABEL = IRIS_DATA.iloc[100:140, 4].values

SETOSA_VERSICOLOR_TRAINING_LABEL = np.append(SETOSA_LABEL, VERSICOLOR_LABEL)
SETOSA_VIRGINICA_TRAINING_LABEL = np.append(SETOSA_LABEL, VIRGINICA_LABEL)
VERSICOLOR_VIRGINICA_TRAINING_LABEL = np.append(VERSICOLOR_LABEL, VIRGINICA_LABEL)

# In this example, it uses only Sepal width and Petal width to train.
SETOSA_DATA = IRIS_DATA.iloc[0:40, [1, 3]].values
VERSICOLOR_DATA = IRIS_DATA.iloc[50:90, [1, 3]].values
VIRGINICA_DATA = IRIS_DATA.iloc[100:140, [1, 3]].values

# Use one-vs-one strategy to train three classes data set, so we need three binary classifiers
# setosa-versicolor, setosa-viginica, and versicolor-viginica
SETOSA_VERSICOLOR_TRAINING_DATA = np.append(SETOSA_DATA, VERSICOLOR_DATA, axis=0)
SETOSA_VIRGINICA_TRAINING_DATA = np.append(SETOSA_DATA, VIRGINICA_DATA, axis=0)
VERSICOLOR_VIRGINICA_TRAINING_DATA = np.append(VERSICOLOR_DATA, VIRGINICA_DATA, axis=0)

# Prepare test data set. Use only Sepal width and Petal width as well.
SETOSA_TEST = IRIS_DATA.iloc[40:50, [1, 3]].values
VERSICOLOR_TEST = IRIS_DATA.iloc[90:100, [1, 3]].values
VIRGINICA_TEST = IRIS_DATA.iloc[140:150, [1, 3]].values
TEST = np.append(SETOSA_TEST, VERSICOLOR_TEST, axis=0)
TEST = np.append(TEST, VIRGINICA_TEST, axis=0)

# Prepare the target of test data to verify the prediction
SETOSA_VERIFY = IRIS_DATA.iloc[40:50, 4].values
VERSICOLOR_VERIFY = IRIS_DATA.iloc[90:100, 4].values
VIRGINICA_VERIFY = IRIS_DATA.iloc[140:150, 4].values
VERIFY = np.append(SETOSA_VERIFY, VERSICOLOR_VERIFY)
VERIFY = np.append(VERIFY, VIRGINICA_VERIFY)

# Define a setosa-versicolor Perceptron() with 2 attributes
perceptron_setosa_versicolor = PerceptronClassifier(2, ('Iris-setosa', 'Iris-versicolor'))
# Train the model
perceptron_setosa_versicolor.train(SETOSA_VERSICOLOR_TRAINING_DATA, SETOSA_VERSICOLOR_TRAINING_LABEL)

# Define a setosa-virginica Perceptron() with 2 attributes
perceptron_setosa_virginica = PerceptronClassifier(2, ('Iris-setosa', 'Iris-virginica'))
# Train the model
perceptron_setosa_virginica.train(SETOSA_VIRGINICA_TRAINING_DATA, SETOSA_VIRGINICA_TRAINING_LABEL)

# Define a versicolor-virginica Perceptron() with 2 attributes
perceptron_versicolor_virginica = PerceptronClassifier(2, ('Iris-versicolor', 'Iris-virginica'))
# Train the model
perceptron_versicolor_virginica.train(VERSICOLOR_VIRGINICA_TRAINING_DATA, VERSICOLOR_VIRGINICA_TRAINING_LABEL)

# Run three binary classifiers
predict_target_1 = perceptron_setosa_versicolor.classify(TEST)
predict_target_2 = perceptron_setosa_virginica.classify(TEST)
predict_target_3 = perceptron_versicolor_virginica.classify(TEST)

overall_predict_result = []
for item in zip(predict_target_1, predict_target_2, predict_target_3):
unique, counts = np.unique(item, return_counts=True)
temp_result = (zip(unique, counts))
# Sort by values and return the class that has majority votes
overall_predict_result.append(sorted(temp_result, reverse=True, key=lambda tup: tup[1])[0][0])
# The result should look like:
# [('Iris-setosa', 2), ('Iris-versicolor', 1)]
# [('Iris-setosa', 2), ('Iris-versicolor', 1)]
# [('Iris-setosa', 2), ('Iris-versicolor', 1)]
# [('Iris-setosa', 2), ('Iris-versicolor', 1)]
# [('Iris-setosa', 2), ('Iris-versicolor', 1)]
# [('Iris-setosa', 2), ('Iris-versicolor', 1)]
# [('Iris-setosa', 2), ('Iris-versicolor', 1)]
# [('Iris-setosa', 2), ('Iris-versicolor', 1)]
# [('Iris-setosa', 2), ('Iris-versicolor', 1)]
# [('Iris-setosa', 2), ('Iris-versicolor', 1)]
# [('Iris-versicolor', 2), ('Iris-virginica', 1)]
# [('Iris-versicolor', 2), ('Iris-virginica', 1)]
# [('Iris-versicolor', 2), ('Iris-virginica', 1)]
# [('Iris-versicolor', 2), ('Iris-virginica', 1)]
# [('Iris-versicolor', 2), ('Iris-virginica', 1)]
# [('Iris-versicolor', 2), ('Iris-virginica', 1)]
# [('Iris-versicolor', 2), ('Iris-virginica', 1)]
# [('Iris-versicolor', 2), ('Iris-virginica', 1)]
# [('Iris-versicolor', 2), ('Iris-virginica', 1)]
# [('Iris-versicolor', 2), ('Iris-virginica', 1)]
# [('Iris-virginica', 2), ('Iris-versicolor', 1)]
# [('Iris-virginica', 2), ('Iris-versicolor', 1)]
# [('Iris-virginica', 2), ('Iris-versicolor', 1)]
# [('Iris-virginica', 2), ('Iris-versicolor', 1)]
# [('Iris-virginica', 2), ('Iris-versicolor', 1)]
# [('Iris-virginica', 2), ('Iris-versicolor', 1)]
# [('Iris-virginica', 2), ('Iris-versicolor', 1)]
# [('Iris-virginica', 2), ('Iris-versicolor', 1)]
# [('Iris-virginica', 2), ('Iris-versicolor', 1)]
# [('Iris-virginica', 2), ('Iris-versicolor', 1)]

# Verify the results
misclassifier = 0
for predict, verify in zip(overall_predict_result, VERIFY):
if predict != verify:
misclassifier += 1
print("The number of misclassifier: " + str(misclassifier))

(Source code can be found here)

The complete source code of the Perceptron Learning Algorithm is at https://github.com/shunsvinyard/perceptronlib

# Is Learning Feasible?

Many people may see machine learning is magic. It is not. In fact, machine learning is all about math, especially probability and statistics.

## Learning vs. Memorizing

To show if learning is feasible, we need to define what is learning? In the section, A Simple Example, the Perceptron Learning Algorithm eventually succeeds to classify all the iris flowers, but is it really learning? Or, it is just memorizing? To answer these questions, we define in-sample error and out-of-sample error. The in-sample error means the error rate within the sample. For example, in the end, the Perceptron learning model can classify all the iris flowers in the Iris Data Set. Its in-sample error is 0, i.e., no error at all. In contrast to in-sample error, the out-of-sample error means the error rate outside of the sample. In other words, it indicates the number of errors when the learning model sees new data. Take the same example, if we feed new data to the Perceptron learning model, the rate of misclassification is its out-of-sample error. Therefore, to say a learning model is real learning, not only the in-sample error has to be small, but also the out-of-sample error has to be small.

## So, is Learning Feasible?

The simple answer is yes: the learning is feasible in the mathematical sense.

In probability and statistics, there are two important theories which can briefly show the learning is feasible: Central Limit Theorem and Law of Large Number. The Central limit theorem states that the distribution of the average of a large number of independent and identically distributed variables will be approximately normal distribution, regardless of the underlying distribution. Law of Large Numbers describes that as more sample are collected, the average of the results should be close to the expected value, and will tend to become closer as more trials are performed. The mathematical theory guarantees that a sample proportion or a difference in sample proportions will follow something that resembles a normal distribution as long as 1. Observations in the sample are independent. 2. The sample is large enough. In an oversimplified sense, the learning results we see in-sample should be seen in out-of-sample data. Therefore, learning is feasible.

Of course, this statement is oversimplified. The detail of theories to support the feasibility of machine learning is out of the topic of this article. There are many resources to address this topic. One resource I recommended is this book, Learning from Data, especially ch2 VC Dimension to generalization.

In short, to make learning feasible, the learning model has to achieve:

• In-sample error is small
• Out-of-sample error is close to the in-sample error

# How do we know if we learn well?

The conclusion in the previous section addresses that if the in-sample error is small and the out-of-sample error is close to the in-sample error, then the learning model learns something. What does it mean the in-sample error is small and the out-sample error is close to the in-sample error? How small the in-sample error should be and how close the out-of-sample error could be? Of course, if both in-sample error and out-of-sample are zero, the learning model learns perfectly. Unfortunately, most of the real-world problems are not like it. In fact, the machine learning approach is not about finding the ideal target function; instead, the machine learning approach looks for a hypothesis function $f$ which approximates to the target function $g$. (the target function is always unknown; otherwise, we would not bother to use machine learning approach.) In other words, the machine learning approach looks for a good enough solution $f$  which is close enough to $g$. What does good enough means? To quantify how well  $f$ close to $g$, we need a way to define the distance between $f$ and $g$. Usually it is called error measure or error function (the error is also refereed to cost or risk).

## Error Measure

In general, we define a non-negative error measure by taking two arguments, expected and predicted output, and computing a total error value over the whole dataset.

$Error = E \left( h, g \right)$, where $h$ is a hypothesis and $g$ is the target function. $E \left( h, g \right)$ is based on the errors on individual input $x_i$. Therefore, we can define a pointwise error measure $e \left( h \left( x_i \right), g \left( x_i \right) \right)$. The overall error then is the average value of this pointwise error. $Error_{overall} = \frac{1}{n} \sum_{i=1}^{n} | h \left( x_i \right) - g \left( x_i \right) |$

The choice of an error measure affects the choice of the learning model. Even if the data set and the target function are the same, the meaning of error measure may be vary depend on different problems. For example, an email spam classifier. There are two error outcomes of a learning model for this email spam problem: false positive and false negative. The former means an email is a spam but the learning model determines it is not; the latter indicates that the learning model says an email is spam, but it is not. Imagine two scenarios: our personal email account and working email account.

In the personal email account case, if the learning model fails to filter out spam emails, it is probably fine. Just annoying. On the other hand, if the learning model filters out some email which is not spam, for example, an email from friends or a credit card company. In this case, it probably fine too. We can use the table below to measure this case.

 An email is a spam An email is not spam Classify an email is a spam 0 1 Classify an email is not spam 1 0

If the learning model classifies an email correct, the cost of this error is 0. Otherwise, the cost is 1.

In the other case: working email account. Same as the personal account, if a learning model fails to filter out spam, it is annoying but fine. However, if the learning model classifies the emails from our boss as spam, and leads us to miss these email, then it is probably not fine. Therefore, in this case, the cost of false negative has a heavier weight than the previous example.

 An email is a spam An email is not spam Classify an email is a spam 0 10 Classify an email is not spam 1 0

Again, the error function (or cost function, risk function) really depends on different problems and should be defined by customers.

## Overfitting and Underfitting

The conclusion of the section, Is Learning Feasible, shows that to make learning feasible, the learning model has to achieve that in-sample error is small and the out-of-sample error is close to the in-sample error. Normally, a training set is a representation of the global distribution, but it is not possible to contain all possible elements. Therefore, when training a model, it is necessary to try to fit the training data, i.e., keep the in-sample error small, but it is also necessary to try to keep the model able to generalize when the unseen input is presented, i.e., keep the out-of-sample error small. Unfortunately, this ideal condition is not easy to find, and it is important to be careful of these two phenomena: overfitting and underfitting.

The figure shows a normal classification on two classes data.

Underfitting means that the learning model is not able to capture the pattern shown by the training data set. The picture below demonstrates the case of underfitting.

Usually, underfitting is easy to be observed because the learning model does not fit the in-sample data well when it happens. In other words, when underfitting happens, the in-sample error is high.

Overfitting is opposite of underfitting. It means that the learning model fits the training data very well, but too well to fits the out-of-sample data. That is to say, an overfitting learning model loses its generalization, so when the unknown input is presented, the corresponding prediction error can be very high.

Different than underfitting, overfitting is difficult to notice or prevent, because the in-sample error of an overfitting learning model is usually very low. We may think we train this model very well until the new data come, and the out-of-sample error is high. Then we realize that the model is overfitting.

Overfitting usually happens in the case the learning model is more complex than is necessary to represent the target function. In short, the overfitting issue can be summarized as follows:

• The number of data points increases, the chance of overfitting decreases.
• If the data set has much noise, the chance of overfitting is high.
• The more complex the model is, the easier overfitting happens.

# Types of Learning

Based on the type of data and problems we want to solve, we can roughly categorize learning problem into three groups: supervised learning, unsupervised learing, and reinforcement learning.

## Supervised Learning

Supervised Learning is probably the most well-studied learning problem. Essentially, in supervised learning, each sample contains the input data and explicit output, i.e., correct output. In other words, supervised learning can be formed as (input, correct output).

The Iris Data Set is a typical supervised learning. The data set has five attributes as input: sepal length, sepal width, petal length, petal width, and correct output, types of the iris plant. We can visualize the supervised learning by plotting the Iris Data Set. For the sake of simplicity, we plot a 2-D picture by taking sepal length and petal length as input and using different colors and symbols to indicate each class.

The main goal of supervised learning is to learn a model from data which has correct labels to make predictions on unseen or future data. Common supervised learning problems include: Classification for predicting class labels and regression for predicting continuous outcomes, for example, spam detection, pattern detection, natural language processing.

## Unsupervised Learning

In contrast to supervised learning, the data set of unsupervised does not contain correct output, and can be formed as (input, ?)

If we do not plot the type as different colors, the Iris Data Set became an example of unsupervised learning.

In many cases, a data set only has input but does not contain a correct output, like the picture above. Then, how can we learn something from unsupervised learning problems? Fortunately, although there is no correct output, it is possible to learn something from the input. For example, in the picture above, each ellipse represents a cluster and all the points inside its area can be labeled in the same way. (of course, we knew the Iris Data Set has three class, here we just pretend we did not know that Iris Data Set has three classes). Using unsupervised learning techniques, we can explore the structure of a data set to extract meaningful information without having any prior knowledge of their group. Because unsupervised learning problems do not come with labels, it deals with problems like discovering hidden structures, finding subgroups with clustering, and dimensionality reduction for data compression.

## Reinforcement Learning

Different from supervised learning which has output, and unsupervised learning which contains no correct output, the data set of reinforcement learning contains input and some grade of output, and can be formed as (input, some output with grade).

One good example of reinforcement learning is playing games such as chess and go. The determination of the next step is based on the feedback of current state. Once the decision of the new step made, a grade of this step observed. The goal of reinforcement learning is to find out the sequence of most useful actions, so the reinforcement learning model can always make the best decision.

## Online vs. Offline

Learning problems can also be viewed by the way how do we train models.

Online Learning

The data set is given to the algorithm one example at a time. Online learning happens when we have streaming data that the algorithm has to process ‘’on the run’’.

Offline Learning

Instead, give the algorithm one example at a time, we have many data to train the algorithm in the beginning. The Iris Data Set example is offline learning.

Both online and offline learning can be applied to supervised, unsupervised, and reinforcement.

# Challenges in Machine Learning

With the development of machine learning (ML) and artificial intelligence (AI), many AI/ML powered tools and programs have shown unlimited potential in many fields. However, machine learning is not that powerful as many people thought. Instead, it has many significant weaknesses that many researchers eager to solve. Followings are some challenges machine learning industry faces today.

First of all, there is no easy way to transfer a learned model from one learning problem to another. As a human, when we are good at one thing, say playing go, chances are we are also good at similar games such as playing chase, bridge, or, at least, easy to learn those. Unfortunately, for machine learning, it needs to start over the learning process for every single problem even if the problem is similar to others. For example, the Perceptron learning model that we trained to classify iris plant in the previous section, what it can do is to classify iris plant, no more. If we want to classify different flowers, we have to train a new learning model from scratch.

The second challenge is that there is no smart way to label data. Machine learning can do many things and can learn from all kinds of data. Nevertheless, it does not know how to label data. Yes, we can train a learning model to recognize an animal, say cats, by feeding many cat pictures. However, in the beginning, someone, a human, needs to label the pictures if it is a cat picture or not. Even if unsupervised learning can help us group data, it still does not know if a group of pictures is a cat or not. Today, labeling data still requires human involves.

Third, even the creator of a machine learning solution does not know why the machine learning solution makes this decision. The essence of machine learning is learning from data. We feed data to a learning model, and it predicts the results. In the Perceptron Learning Algorithm example, the weights of the final hypothesis may look likes [ -4.0, -8.6, 14.2], but it is not easy to explain why the learning model gave us these weights. Even the simplest learning algorithm, Perceptron, we are not able to explain why. Needless to say, it is almost impossible to explain how more sophisticated learning algorithms work. This issue may lead to serious consequence. Especially, for learning problems that have a high cost of error, for example, self-driving car, if the thing goes wrong and we do not know how does it go wrong, how do we fix it.

These three challenges are just the tip of the iceberg. In a word, although machine learning seems to have unlimited potential, it still has a long way to go.

As mentioned at the beginning of this article, machine learning is a very hot topic today. There are a lot of books and online resource talking about it. Most of them focus on the usage of libraries, e.g., OpenCV, TensorFlow, and machine learning algorithms such as SVM, neural network. All of these are good stuff. However, I think, in order to master these tools, a comprehensive understanding of the theory of machine learning is quite necessary. Therefore, I highly recommend this book, Learning From Data. The authors, Yaser S. Abu-Mostafa, Malik Magdon-Ismail, and Hsuan-Tien Lin, did a great job to explain the fundamentals of machine learning in both mathematical and practical ways. Of course, it talks about much mathematics. They are not hard but definitely, takes time to understand fully. Once we understand the fundamental concept of machine learning comprehensively, it is easier for us to learn advanced topics in machine learning. This book is highly recommended for people who want to build a solid foundation of machine learning.

# Quick Start to Use Visual Studio Code for C++ Programmers in Linux

[Updated: May 04, 2017]

As a software engineer, a great editor or IDE helps improve productivity. I have used many IDEs and editors to write the code for many years such as Visual Studio, VIM, Eclipse, NetBeans, Notepad++. In Windows environment, Visual Studio is always my favorite tool. However, I was still looking for a good tool to code in Linux environment. Now I finally found one, Visual Studio Code, released by Microsoft. Visual Studio Code is a lightweight and cross-platform version of Visual Studio. It supports several programming languages and many features to help programmers.

Microsoft has posted a lot of fantastic documents of Visual Studio Code. However, most of these documents are written for the general purpose of Visual Studio Code. It took me a lot of time to figure out how to configure Visual Studio Code to work with C++ in Linux. Therefore, the purpose of this article is to help C++ programmers, like me, who want to use Visual Studio Code for C++ programming in Linux. This Quick Start aims to provide step by step guideline for C++ programmers who want to spend as little effort as possible to use Visual Studio Code in Linux environment. The Quick Start includes use Visual Studio Code to build C++ code with CMake and Make, and use Visual Studio Code to debug C++ code in real time.

This Quick Start uses a simple C++ example to demonstrate how to configure Visual Studio Code.

The Quick Start is based on the following environment:

# Installation of Visual Studio Code

First, go to https://code.visualstudio.com/Download. Several options are available. Second, choose an appropriate Visual Studio Code package to download. Visual Studio Code provides .deb for Debian and Ubuntu, and .rpm for Red Hat, Fedora, and CentOS. Then, install the downloaded package based on Linux distribution.

• Debian and Ubuntu

In Debian and Ubuntu, use dpkg command to install Visual Studio Code.

sudo dpkg -i code_1.11.2-1492070517_amd64.deb

This command installs Visual Studio Code into /usr/share/code.

• Red Hat, Fedora, and CentOS

In Red Hat, Fedora, and CentOS, use rpm command to install Visual Studio Code.

sudo rpm -Uvh code-1.11.2-1492070635.el7.x86_64.rpm

Same as dpkg, Visual Studio Code is installed in /usr/share/code.

• Standalone

The standalone binary release is also available.

Download the standalone Visual Studio Code, untar the package, and then, start Visual Studio Code:

tar xvzf code-stable-code_1.11.2-1492070517_amd64.tar.gz
cd VSCode-linux-x64
./bin/code

# A Simple C++ Example

The sample code includes a QuickStart library which provides a very simple function, add, and a unit test which leverages the power of Boost Unit Test Framework.

## Sample Code

The sample code has following hierarchy.

include/QuickStart.hpp

class QuickStart
{
public:
};

src/QuickStart.cpp

#include "QuickStart.hpp"

{
return (a + b);
}

test/QuickStartTest.cpp

#define BOOST_TEST_MODULE QuickStartTest

#include "QuickStart.hpp"

#include &amp;lt;boost/test/unit_test.hpp&amp;gt;

BOOST_AUTO_TEST_SUITE(QuickStartSuite)

{
QuickStart quickStart;

}
BOOST_AUTO_TEST_SUITE_END()

# Build with CMake

CMake is a very popular tool to build C++ code, which is designed to build, test, and package software. It supports directory hierarchy and uses compiler-independent approach. CMake is also much easier to use than Make. There are two CMake extensions available for Visual Studio Code: CMake and CMake Tools.

• CMake provides the language service support for CMake.
• CMake Tools gives the ability to build CMake targets.

To have the best CMake experience, install both extensions is recommended.

Assume there are CMakeLists.txt files in QuickStart Project, and the hierarchy is as follows.

Top level CMakeLists.txt

cmake_minimum_required (VERSION 3.5)
project(QuickStart)

set(QUICK_START_VERSION_MAJOR 1)
set(QUICK_START_VERSION_MINOR 0)
set(SOURCES ${PROJECT_SOURCE_DIR}/src/QuickStart.cpp) include_directories("${PROJECT_SOURCE_DIR}/include")
add_library(QuickStart STATIC ${SOURCES}) enable_testing() add_subdirectory(test) test/CMakeLists.txt find_package(Boost REQUIRED COMPONENTS unit_test_framework) include_directories("${PROJECT_SOURCE_DIR}/include")

set(TEST_SOURCES ${PROJECT_SOURCE_DIR}/test/QuickStartTest.cpp) set(TEST_LIBS QuickStart) add_executable(test_main${TEST_SOURCES})
target_link_libraries(test_main ${TEST_LIBS}${Boost_LIBRARIES})

First, type Ctrl+Shift+P to enable ‘’Command Palette’’ and type cmake, Command Palette displays a list of available commands that we can execute from here. In this case, it shows all commands related to cmake.

Choose CMake:Configure for the first time. A new list bumps up; the list has all available configurations.

Second, after we choose one configuration (e.g. Debug), CMake Tools automatically setup the configurations for us.

Now, Visual Studio Code is ready to use. Type Ctrl+Shift+P to enter Command Palette and type cmake, and choose CMake:Build.

Visual Studio Code starts building the code.

Choose CMake: Clean.

Visual Studio Code cleans the build folder and built files.

Use CMake: Run tests to run unit tests

In addition to using Command Palette, CMake Tools provides a quick and convenient way to trigger most common actions via Side Bar.

It shows all the available build targets to change.

## Other Configurations

All available CMake configurations can be found at setting.json. To enable setting.json, from the menu bar do [File->Preference->Settings]. Visual Studio Code has the ability for users to modify nearly every part of Visual Studio Code. In setting.json, there is a section called CMake Tools configuration which has many options for users to modify.

Modification of these options is straightforward. One option worth mentioning is ‘’cmake.preferredGenertors.’’

## Ninja Build

’Ninja is a small build system with a focus on speed.’’ CMake Tools provides several code generators. Ninja is the default option. When a project gets bigger, the longer time to build the project. Ninja build system performs faster on build than GNU Make. Using Ninja to build the code is highly recommended, i.e. the default option of ‘’cmake.preferredGenertors.’’

# Debug

One important reason to use IDE is its debugging support. Visual Studio Code has great debugging support.

To start debugging, click the debug icon in the Activity Bar.

For the very first time, we need to add a configuration.

Then, choose the debugger, C++ (GDB/LLDB), for Linux environment.

Visual Studio Code generates a configuration file, launch.json, for debugging. This configuration file provides variety options to modify. The most important option is to tell Visual Studio Code which binary to debug. In this example, we debug the unit test binary, test_main, which is located at ${workspaceRoot}/build/test/. Now, Visual Studio Code is ready to debug. # Build with Make Although CMake is easier to use than GNU Make, GNU Make is used by many people. This section demonstrates how to configure Visual Studio Code to build the code with GNU Make. Assume there is a Makefile in QuickStart Project, and the hierarchy is as following. Makefile BUILDDIR = build INCDIRS = include SRCDIR = src TESTDIR = test OBJS =$(addprefix $(BUILDDIR)/,$(patsubst %.cpp, %.o, $(notdir$(wildcard $(SRCDIR)/*.cpp)))) TESTOBJS =$(addprefix $(BUILDDIR)/,$(patsubst %.cpp, %.o, $(notdir$(wildcard $(TESTDIR)/*.cpp)))) LIBQUICKSTART_FNAME =$(BUILDDIR)/libQuickStart.a
TEST_FNAME = $(BUILDDIR)/test_main CXXFLAGS +=$(INCDIRS:%=-I%) -std=c++1y
LDFLAGS += $(LIBQUICKSTART_FNAME) -lboost_unit_test_framework .PHONY: all clean all:$(LIBQUICKSTART_FNAME) $(TEST_FNAME)$(LIBQUICKSTART_FNAME): $(OBJS)$(AR) $(ARFLAGS)$@ $^$(TEST_FNAME): $(TESTOBJS)$(LIBQUICKSTART_FNAME)
$(CXX) -o$(TEST_FNAME) $^$(LDFLAGS)

$(BUILDDIR)/%.o:$(SRCDIR)/%.cpp | $(BUILDDIR)$(COMPILE.cc) $&amp;lt;$(OUTPUT_OPTION)
$(COMPILE.cc) -MM -MP -MT$@ $&amp;lt; -o$(BUILDDIR)/$*.d$(BUILDDIR):
@mkdir $(BUILDDIR)$(BUILDDIR)/%.o:$(TESTDIR)/%.cpp |$(BUILDDIR)
$(COMPILE.cc)$&amp;lt; $(OUTPUT_OPTION)$(COMPILE.cc) -MM -MP -MT $@$&amp;lt; -o $(BUILDDIR)/$*.d

run_test:
$(TEST_FNAME) clean:$(RM) $(BUILDDIR)/*.d$(BUILDDIR)/*.o $(LIBQUICKSTART_FNAME)$(TEST_FNAME)

Visual Studio Code provides a feature, Tasks, to integrate with external tools, including Make.

First, type Ctrl+Shift+P to enable ‘’Command Palette’’ and type tasks, Command Palette displays a list of available commands that we can execute from here. In this case, it shows all commands related to Tasks.

In the very first time, choose Tasks: Configure Task Runner. A new list bumps up; the list has all available configurations.

Second, to use Make, choose Others, and then Visual Studio Code generates a configuration file, tasks.json. This file is where users can set up Tasks with external tools.

Third, change “command”: “echo” to ‘’command’’: “make.” In the Makefile of Quick Start example, there are three build targets: all, run_test, and clean, so add tasks, all, clean, and run_test, as below.

Now, Visual Studio Code is ready to run these tasks that we just defined. Type Ctrl+Shift+P to enable ‘’Command Palette’’ and type tasks, it shows all the available Tasks options. Choose Tasks: Run Task.

Now, it shows all three tasks that we just added: all, clean, and run_test.

# Brief Introduction of Data Center Technologies

I have been working on data center industry years. I felt the entry barriers of data center technologies is high. At the beginning of my career, it is a challenge to me to understand a lot of technologies and how they work together without a systematical guideline. After several years’ experience, I thought maybe I could do something to help people who have the same problem as I used to have. So, I started the series of data center technologies. The article is the first article of this Data Center Technologies series, and it gives a general idea of data center technologies. I hope by summarizing my experience, newbies in this industry or someone is interested in data center technologies can be benefited.

# What is a Data Center? What does it do?

A data center can be viewed as a facility (or facilitates) whose duty is to manage a large amount of data. A data center is logically composed of application software, software infrastructure (system software), and hardware infrastructure as the picture below.

Hardware infrastructure indicates physical components such as servers, storage systems, network components, power supply, cooling system, and anything needed to support data center operations. Software infrastructure, on the other hand, refers to software running on hardware infrastructure, including low-level programs (e.g. operating systems) that interact with the hardware. Software infrastructure also offers many data center features to serve a large amount of data such as data protection, backup, and virtualization. Besides, software infrastructure also provides management functionality for IT staff not only manages data but also manages hardware resource.

The top level of the data center is application software which provides services to end users. Application software can vary based on usage. The applications can be a database, email service, web applications, and enterprises applications such as ERP, billing systems.

# What should the Ability of a Data Center have?

Data is the most precious asset in data centers. Data centers require abilities to ensure data service works properly; many technologies are used in data centers to achieve this goal.

## High Availability

Data centers are supported to run 24/7/365 without interruption. Planned or unplanned downtime can cause business users serious damage. High availability and disaster recovery are critical capabilities that data centers are required to ensure the continuity of business.

The circumstances to break the continuity of data centers can result from various reasons.

System Failure and Data Center Outage

Murphy’s Law clearly stated ‘’anything that can go wrong will go wrong’’, and can be used to describe the data center case – any component in a data center can go wrong: processors, memory, servers, disks, network, and power supply. To minimize the effect of failure, generating multiple copies of data in different power domain is one common way to protect data. For example, hard drives have a surprisingly high failure rate based on Backblaze’s Hard Drive Report. Even if storage media never had a failure, it would not be possible to last forever. When one drive stops working, there is at least one additional copy of the data that was on this drive on another drive. Besides, different power domain can also be referred to that backups, i.e. copies, are in different geographic locations to prevent the case that an entire data center loss due to physical disasters such as earthquake, fire, or flooding.

Virtualization is one key solution to prevent data center outage and recover from a system failure by the decoupling of the physical hardware components. With virtualization, software and services can keep running without interruption when underlying physical components fail. (refer to Software Defined Storage and Software Defined Data Center)

Software Bug and Logical Error

Human makes mistakes, the software may have bugs, and malware may be injected, which could result in data corruption, services stop, performance dropping, or more serious damage. When logical errors happen, ‘’turning back the clock’’, i.e. reverse to a good version of software or data, is a solution to return the system to normal. Taking snapshots or full clones periodically are required to turn back the clock when an error happens.

### Measuring Availability

Availability indicates that how long components, applications, and services should function properly over a certain period. Although the ability to recover from a disaster can only be known after a disaster happens, there are ways to measure this ability by either historical record or estimate.

The simplest form of measuring availability is in absolute values (e.g. 990 out of 1000 hours) or percentage terms (99.9%) by applying the equation:

availability = operating time / (operating time + outage time)

MTBF, MTTR, and MTTF

More accurate measurements can be used – Mean Time between Failure (MTBF), Mean Time to Recovery (MTTR) and Mean Time to Failure (MTTF).

• MTBF is the average time between two successive failures of a specific component or a specific system.
• MTTR refers to the period of a component, or a system is recovered after a failure.
• MTTF gives the average period between the recovery of a component or a system and a new outage.

Based on these measurements, availability can be written as:

availability = MTTF / (MTTF + MTTR)

This equation illustrates that availability increases substantially as the MTTR is reduced, i.e. the faster recovery services to normal, the higher availability data centers have.

RTO and RPO

RTO and RPO are used to measure the ability of recovery after a disaster.

• The Recovery Time Objective (RTO) indicates the maximum allowable period for resuming services after a failure. For instance, an RTO of 60 minutes means that the maximum restart time is 60 minutes.

• The Recovery Point Objective (RPO) specifies the maximum tolerated level of data loss in a business continuity solution. For example, if the RPO is set to 5 hours, it means the backups must be made at intervals of 5 hours or less.

## Ability to Perform High Performance

Data centers are supported to serve a large number of clients and massive data and built with premium hardware and latest technologies. As a result, performing high performance is highly expected in a data center.

## Simplicity (Easy to Use)

The traditional data center may take IT department hours to days to deploy a data center to be fully operational. An all-in-one solution users can just turn on that handles anything users throw at it is on demand. Also, a data center can be huge and complicate. It requires an amount of resource to maintain and operate. Using software to automate data center management not only lowers cost but also reduces the risk of mistakes resulted from human operations.

On the other hand, data centers deal with the gigantic amount of data, e.g. million and million files, folders, and user records. When the scale of data becomes huge, it is impractical to manage them by human alone. Simple management tools are desired.

The goal is to make a data center able to manage both data and itself automatically, i.e. no IT involved (Virtualization is a solution trying to achieve this goal).

After all, data center management must provide features to simplify management. For instance,

• Trivial to deploy such as plug-and-play
• Self-healing such as automated rebalancing workload to utilize resource
• Trivial to scale up and scale out, i.e. add/replace node or drives or components
• Graphical UI and step-by-step instructions for repairing such as drive, fan replacement
• Automatic patch maintenance
• Rolling, in-place software upgrade, so no service is interrupted
• Remote management, so IT can manage and monitor remotely

## Scalability in both Performance and Capacity

The resource is limited, but the speed of data growth is faster every year. Traditionally, IT department estimated the data growth for next few years, purchase the data center equipment based on the estimation, and then hopefully it is enough for next few years. Unfortunately, this method does not work in the era that data growth rate is so fast. Purchasing data center components few years prior is not practical. In addition to capacity, performance also requires being upgraded when the limitation exceeds. When the systems in a data center need to expand, normally there are two ways to do it: scale up and scale out.

### Scale Out and Scale Up

Scale up refers to systems that expand capacity by adding more drives and increase computation power by adding more processors or more memory.

Scale up has some advantages such as no network (communication) cost, cheap (only buy needed component instead of a whole node), and scale up solution is easy to design.

However, scale up also has some disadvantages. The major disadvantage is scale limitation. A single server has limited space, e.g. the number of PCIE slots and RAID controllers. Scale up cannot be beyond the limitation. Second, adding only one type of components can cause performance imbalance. For instance, if adds many drives, but processors and memory remain the same, processors and memory can become the new bottleneck.

Scale out refers to systems that do not scale by adding individual components such as memory, drive, and processors. Instead, scale out systems expand by adding a new node (a new node could be a server which includes processors, memory, and drives).

Scale our solution is the current direction of data center technologies toward to. Scale out solves the scale limitation issue that scale up has. Besides, scale-out architecture can be designed to support hundred or even more nodes to generate millions of IOPS and to have Petabytes capacity. The major challenge of the scale-out solution is difficult to design. It requires heavy network traffic for communication and needs to solve load balancing and migration programs. These issues and scale-out solutions are Software Defined Storage and Software Defined Data Center aim to solve and provide.

## Ability to Provide Data Insight and Data Analytics

No doubt, data is valuable. However, without data analysis, data itself is meaningless and useless. An ideal data center should be data awareness and provide data analysis to help enterprises discover the data insight and generate business value. With the visibility of system behavior and system level real-time data analytics, it helps to understand what’s going on inside the system. It also creates chances to improve the overall system performance such as better usage of files and storage and locate the root cause of issues easily. Discovering hidden value from data is on-demand, but the bottom line is data centers should be, at least, able to collect and provide data.

# Storage Technologies Overview

In today’s big data era, data is growing at unprecedented speed. Also, data is the enterprises’ most valuable asset. Therefore, storage systems take on the heavy responsibility to store, protect, and manage data. This section covers basic storage system concepts and technologies, including architectures of the storage system, the connectivity for storing data, and presentations of the storage system.

## Storage System Architectures

Storage systems can be developed in the following architectures: Storage Area Network (SAN), Network Attached Storage (NAS), and Directed Attached Storage (DAS).

### Storage Area Network (SAN)

SAN is a dedicated network that allows users to share storage. SAN consists storage devices, interconnecting network components such as switches, routers, and adapters, and host servers connected to this network.

The characteristics of SAN can be summarized as following:

• Dedicated network; not accessible by other devices through LAN
• Use protocols include Fibre Channel (FC), Fibre Channel over Ethernet (FCoE), Internet Small Computer Systems Interface (iSCSI)
• Provide block level access only, but file systems can be built on top of SAN
• Additional capabilities include disk zoning, disk mapping, LUN masking, and fault management

### Network Attached Storage (NAS)

NAS is a file level storage devices that connect to a network and accessed using file sharing protocols.

NAS has following characteristics:

• File level access storage architecture with storage attached directly to LAN
• Support Ethernet connection and allow administration manages disk space, disk quotas, provide security, and utilize snapshot
• Support multiple file-level protocols: NFS, SMB, CIFS

### Directed Attached Storage (DAS)

DAS is a block level device which is directly attached to the host machines. The storage can be internal or external disk enclosures, i.e. JBOD and RAID, with interfaces such as SATA, eSATA, SAS, and FC. No network is required.

## Presentations of Storage

To allow hosts, OS, and applications to communicate over a network with storage systems, they must agree on how they communicate. Storage systems provide three types of interface for communication: block, file, and object. If a storage system provides block interface, it is usually called block storage system. Similarly, if a storage system offers file interface, it is called file storage system; if a storage system has object interface, it is called object storage system. However, a storage system can present more than one interface. The picture below shows three types of storage presentation.

### Block

A block device, e.g. hard drive, allows data to be read or written only in blocks of a fixed size (e.g. 512bytes, or 4KB). Block storage presents storage as logical block devices by using industry standard iSCSI or FC connectivity mechanisms. In Storage Area Network, SAN uses LUN (Logical Unit Number) to represents a logical abstraction (virtualization) layer between the physical disk devices and the host machines. The host OS sees the logical unit as a block device and can be formatted with file systems. A block device requires a block driver (in host machines) that offers the kernel the block-oriented interface that is invisible to the user or applications opening the /dev (in Linux) entry points. Block storage allows for manipulation of data at byte-level.

### File

As its name implies, file-based storage systems provide shared access to entire file systems down to the individual file. Files are structured in a filesystem and organized hierarchically so that each file can be located by path. Besides, each file has metadata that contains attributes associated with it, including owner, permission, group, and size. The filesystems are network attached and require file level protocols such as NFS (Linux) and SMB (Windows). In file storage systems, data is written and read into files with variance-length. File storage system can be built on top of block storage system.

Usually, file-based storage has following characteristics:

• Standard file system interface
• Support POSIX interface
• Location and path aware
• Target structure data set

### Object

An object storage provides access to whole objects or blobs of data by using API, e.g. RESTful API, specific to the system. Each object usually includes data, metadata for internal managements, and an UUID: the data can be anything, e.g. photos, videos, or pdf files; metadata contains the information to describe what the data is about; unique ID is assigned to an object when the object is created, so this object can be retrieved by ID. Cloud based storage systems such as Amazon S3 and OpenStack Swift are known for offering object based storage access. There is no limit on the type or number of metadata, which makes object storage flexible and powerful. Anything can be included in the metadata.

Object-based storage has following characteristics:

• Target cold data, e.g. backup, read-only, archive
• Target unstructured data set because anything can be included in metadata
• Write-once (immutable)
• Data access through RESTful API, e.g. S3 and SWIFT
• Immutable objects and versioning
• Location unknown

## Storage Media

Several types of non-volatile media are used in data centers to persist data. The table below shows the brief comparison between four popular storage media.

 Capacity Latency Throughput HDD (SATA/SAS) Large High Low SSD (SATA/SAS) Medium Medium Medium PCIe SSD (AHCI/NVMe) Medium Low High NVDIMM Small Very Low Very High

HDD has last for decades. It offers huge capacity and low price. However, the latency and IOPS are relatively bad. SSD (Flash) is taking over the storage medium market share with the increase of capacity and decline in the price. SSD offers much better performance than HDD.

NVDIMM stands for Non-Volatile Dual In-Line Memory Module. It is the next generation of the storage medium. It has an extremely fast performance with very expensive cost.

## Summary

The picture below illustrates the layout of storage system architectures.

The table below summarizes the difference between NAS, SAN, and DAS.

 NAS SAN DAS Presentation File, Object, Block Block File Protocol Ethernet FC, FCoE, iSCSI SAS, SCSI, FC Separate Network No Yes No

The table compares the different presentations of storage systems.

 File Object Block Use Cases Structured data sets Data with heavy read/write Unstructured data sets Large data sets Immutable data Backup data Archive data Read-only data Virtual Machine Database Email server RAID Data Size Variance size Variance size Fixed size Identified by File name and file path Unique ID Address (LBA) Composed of File and metadata Data, metadata, unique ID Fixed size raw data Access through Standard file access POSIX interface REST API Block device Data granularity Directory Mount point Namespace Object LUN Performance (normal case) Medium to high Low to medium High Data organization Hierarchy Flat namespace

# Software Defined Storage and Software Defined Data Center

Software Defined Storage (SDS) is an approach to data storage systems by using software to provide abstract layers to decouple from the hardware. SDS aims to be able to run on any commodity hardware and a solution for scale-out storage systems.

Software-Defined Data Center (SDDC) is a concept that data center infrastructure is virtualized and represented as software functions. SDDC usually includes Software Defined Networking (SDN) – manage network traffic, Software Defined Compute (SDC) – manage workloads, and SDS – manage data. The combination of SDN, SDC, and SDS is also referred to as Software Defined Infrastructure (SDI) which is an important component in a cloud environment.

This section mainly focuses on SDS.

## Why Software Defined (Virtualization)?

The key of SDS and SDDC is virtualization. Virtualization means using software to provide a computer-generated simulation of the real hardware resource.  The basic idea is using software to manage hardware resource, so the clients or users can see the software layer without knowing the underline hardware.

Virtualization has many benefits and provides many abilities that traditional storage system does not have.

• High Availability: with the software and hardware separation that virtualization provides, it achieves the goal of high availability. Virtualization layer hides the underline hardware components, so software running top on it does not stop when the underline hardware components failure happens, e.g. driver failure or node failure. Although the performance downgrade may be observed, the overall service continues working. Therefore, an idea virtualization environment can accomplish the zero-down time requirement. Besides, Virtualization layer also replicates data to separate power domain to avoid a single point of failure. Even if a failure occurs, there is replication(s) available to be recovered from.

• Hardware Flexibility: because virtualization separates the underline hardware from software, software can run on any commercial hardware. Using commercial hardware to build a data center usually means lower cost than on-prem Besides, hardware upgrade does not affect the software services.
• Scale Out: when the hardware resource usage is approaching the limit, virtualization affords a way to add new nodes, drivers, or another hardware resource without stopping services. Virtualization makes scale out easier. Besides, IT departments no longer need to purchase extra hardware components (especially drives) ahead. Thin provisioning is one method using virtualization technologies to optimize the utilization of available storage, so IT departments can buy drives when they need.
• Simplified Management: because of software-defined, the management UI can be designed as simple as possible. The software can provide policy-based management regardless of the hardware that is being used. Management automation is also feasible with the policy-based management. Therefore, the overall IT required can be reduced, and IT operations can be simplified.
• Advanced Features: virtualization usually comes with advanced features such as snapshot, compression, data deduplication, encryption, and intelligence.

# Cloud and Edge Computing

The Cloud (or Cloud Computing) is a concept that the delivery of on-demand computing resources over the network. In other words, a cloud is a logical group of lots of servers, storage systems, networking components, and software connected to the internet.

## Types of Cloud Services

Cloud is a service-oriented model – ‘’everything as a service’’ (XaaS). Based on different requirements, there are many types of services. The three most common types of services are Infrastructure as a Service (IaaS), Platform as a Service (PaaS), and Software as a Service (SaaS).

• Infrastructure as a Service (IaaS)

In IaaS model, the service providers deliver raw and not a pre-configured resource. The service providers may provide tools to manage the resource, but the users are meant to be the ones who configure the resource and install the software.

Amazon EC2 is an example of IaaS.

• Platform as a Service (PaaS)

PaaS model usually provides a ready environment that comes with pre-configure setup (e.g. LAMP and MEAN). In this model, the users can focus on their main business without spending the time to configure their environment.

Microsoft Azure, Amazon AWS, and Google Cloud Platform are examples of PaaS.

• Software as a Service (SaaS)

SaaS is software that users can access and use without downloading and installing. Many online services people are using daily are SaaS. For instance, Microsoft Office 365, Gmail, XBOX Live, Google Docs.

## Deployment Models

The cloud can be categorized as follows: Public, Private, and Hybrid.

• Public

Public cloud indicates that the cloud providers own and maintain the resource and offers the services to public use. In other words, the public cloud is meant to be used by anybody. Public cloud can be free or pay to use. Microsoft Azure and AWS are the two biggest public cloud providers.

• Private

Private cloud is privately owned by a single organization for its internal use. It means only this organization or its customers can use the cloud resource. There is no architecture difference between public and private cloud.

• Hybrid

Hybrid is the combination of the public and private cloud. This model not only offers cloud for public use but also offers an on-prem solution for exclusive usage.

## Why and why not (Public) Cloud?

One major benefit of using the cloud is hardware free. The users do not need either to purchase any hardware or to maintain the physical components. In many cases, using cloud also means lower costs than managing a dedicated data center, i.e. private cloud. Using cloud eliminates the costs of building a data center and the costs of operating a data center, and reduces the costs of IT departments. However, in some other cases, using cloud does not save money. For big corporations who have heavy network traffic and huge storage demand, chances are the charge of using the public cloud is more than the cost of building a dedicated data center.

Although it seems everything is moving to cloud, in some cases, people slowly move to cloud or refuse to do so. The main reasons are security concern and control. When using the public cloud, data is stored in the cloud providers’ data centers. If the cloud is out of service (e.g. AWS has an outage), the users have nothing to do but wait until it is recovered. Besides, data held somewhere else cannot be tolerated for customers who have very sensitive data. For instances, health care industry and financial industry hold sensitive information, and entertainment industry is very sensitive to leaks. The other reason not to use the public cloud is network connectivity. In some areas where do not have adequate network bandwidth or do not have stable network connectivity, users do not gain benefits from using the public cloud.

## What is Edge Computing?

In contrast to cloud computing, edge computing pushes computing power to the edges of a network. Edge devices, i.e. Internet of Thing (IoT) devices, have limited computing power and storage space. Therefore, the most common model today is that IoT devices are only responsible for collecting data and sending the data back to the cloud. Then, compute-intensive tasks, e.g. data analytics and prediction, occur in the cloud. A typical example is as the picture below.

Nevertheless, in some situations, the computation must take place at the edge devices: sending data back to the cloud from the edge devices and waiting for the computing results from cloud takes too much latency. In the case of an auto-driving car, the auto-driving car needs to respond in a very short period in any traffic conditions. If the auto-driving car is not able to respond in a very short amount of time, accidents may happen.

With respect to the growth of IoT, low-cost and power saving processors, and storage becoming more available, the prospect of edge computing is likely that more computation will be pushed to the edge. Eventually, direct communication can happen between devices. Every use case is possible.

# Trending

It is the beginning of 2017. The technologies of data center keep evolving. The expectation of data center technologies includes:

## Solid-State Drive (SSD) is taking over Hard-Disk Drive

With the decrease in price and the increase in capacity, performance, and reliability of SSD, SSD is taking over HDD in both enterprise storage and personal computer markets. First, SSD price is closing in on HDD very quickly. HDD prices are dropping slightly, and HDD still has the minimum cost to be built. Therefore, the price is no longer the obvious reason to choose HDD, but SSD. Second, HDD has performance limited by physics. HDD is mechanical spin-drive. With the physical limitation, HDD is getting harder and harder to increase performance. In short, SSD will replace the main non-volatile storage media in data center soon.

One opportunity here is that many existing storage systems were designed for using HDD. Therefore, these storage systems are not optimized for SSD. When these systems switch from HDD to SSD, they do not gain significant performance increase. This fact opens a new opportunity for vendors to build new storage system utilizing the power of SSD.

## Cloud Computing and Edge Computing are not competing against each other

Amazon AWS, Microsoft Azure, and Google Cloud Platform continue leading and earning cloud market share. Many customers started moving to cloud solutions. On the other hand, IoT ecosystem is also growth very fast. With more powerful SoC, processors, and other components, IoT device can do more compute-intense tasks at the edge. As mentioned in ‘’What is Edge Computing?’’, in some cases, it is critical to do the computation jobs, e.g. analytics and prediction, at the edge. Cloud and edge are not taking over each other. Instead, cloud and IoT devices coordinate together to build a more mature and comprehensive ecosystem.

## Data analytics and intelligence are important

Data matters, even machine generated data matters. Big data, data analytics, and machine learning are buzzwords that everybody is talking about nowadays. However, most discussion focus on human-generated data. This area will remain hot and important. On the other hand, people realize that machine-generated data has value as well. By providing comprehensive and abundant system level data helps people to remove performance bottleneck, increase utilization, locate the root of issues, and discover the hidden value that people did not know before. Besides, with the power of machine learning, it is possible to take one more step to build data centers that can manage themselves. Shortly more system level data will be pulled out, be analyzed, and more intelligence will be built-in data center technologies.