Supervised Machine Learning
Supervised learning
In
this document, I will not go so much into explaining the concepts and
different algorithms used in supervised learning but will try to
explain it for novice understanding.
Supervised
learning in one of the approaches one can use in machine learning.
Some may say it is is the easier approach as compared to its
counterparts like the unsupervised learning. Supervised Learning
works basically on the principle of having training data where each
instance has an input (a set of attributes) and a desired output (a
target class). Then we use this data to train a model that will
predict the same target class for new unseen instances. In short,
Supervised
learning occurs when the learning data contains the “right
answers”.
There
is an influx of supervised learning algorithms such as the simple
Naïve Bayes and K-Nearest neighbors to advanced linear classifiers,
such as Support Vector Machines (SVM). Some methods, such as decision
trees, will allow us to visualize how important a feature is to
discriminate between different target classes and have a human
interpretation of the decision process but most or if not all of them
work following the same basic principle mentioned above.
Regression
is another of the methods that try to predict real-valued data. We
could consider regression as classification with an infinite number
of target classes. For example, predicting blood sugar level is a
regression task, while predicting if somebody has diabetes or not is
a classification task.
Most of
these methods mentioned above are included in the python scikit Learn
package which has made it easy for implementation.
Picking
out one of the algorithms mentioned above, I will try to explain
simply as possible the
programming steps that show how
we can get from training data to a prediction of new unseen instances
or as many would call it testing data.
Below
are the steps we go through to carry out predictions
-
import important packagesPython has a very big support for machine learning and with it’s rich libraries one can be able to harness its computational prowess. Numpy, Pandas and Scikit learn are just few but the most important packages one can use for a successful machine learning project.
-
Read the training data fileThe pandas package is first put to work at this stage. We use it to read our training data that is most times in a comma separated file (.CSV) or spreadsheet format. The loaded file is organised and easy to read by the machine and further modifications are done here after.
-
Replace all missing fieldsWorking with missing data in fields could yield bad performance by the algorithm. So, these fields are replaced with average values or in extreme cases dropping the variables.This is called data munging and is done in pandas.
-
Drop non-numeric fields;Statistical analytics work with number values and not string values, removing the fields that bear non-numeric values will avoid the likelihood of errors. I personally never saw a mathematical formula that took a non-numeric value and produced a numeric result. These fields are excluded from the statistical analysis.Note: Even the expected classes are represented with numeric values.
-
Define features (X) and labels (Y)The features or attributes are all the numeric fields except the class field. These are plotted in multi-dimensional space to come up with the right classifications. This part of the data is the backbone of the analytics. Without it we would have no basis of classification.The labels or classes on the other hand are the desired output or target.
-
Define train and test dataTrain data is the subset of the original sample, represented by the the attributes selected and their respective target values. We always want our training data to be a representative sample of the population they represent. The test data is one used to show how the algorithm behavesTo get train and test data from the dataset, we shuffle data using cross validation: Cross-validation allows us to avoid this particular case, reducing result variance and producing a more realistic score for our models.
-
Define classifierThis is the stage where the machine learning algorithm is called. The scikit learn package has a number of this algorithms that carryout both regression and classification functions. As mentioned above these algorithms include the K-nearest neighbors and SVM.With the classifier algorithm defined, we can fit the model to the train data to do the learning job. Score is synonymous to test data as fit is to train data, we could select the accuracy measurement by passing any scorer function as an argument.
-
Make PredictionsThis is the point where we bring real world data to the algorithm to make predictions.
Evaluating
our results
The
final step in every supervised learning task should be to evaluate
our best classifier on the previously unseen data, to get an idea of
its prediction performance. How
right are the results of our classifiers? One would ask. We
calculate the accuracy simply as the proportion of times our method
correctly predicted the class of the left-out instance.
Machine Learning at Rainbow
With the brief preamble above on supervised machine learning,
Identify
anomalies on
our system and components that
might be subject to failure, is
made easy.
The
opportunities for machine learning in payments are almost limitless.
Some examples are listed below.
-
Transaction risk management. Use machine learning supervised learning algorithms to identify risk of payment transactions.
-
Merchant risk analysis. Use machine learning to assess acquirer risk in signing a new merchant or managing risk of on-going merchant relationships.
-
Optimizing user’s web experience. Recommending sites for customer check-out.
-
Customer classification. Use machine learning supervised learning to group customers based on a set of customer characteristics.
Comments
Post a Comment