Machine learning

Author: Jerome Cheng, M.D. (see Authors page)
Deputy Editor in Chief: Debra Zynger, M.D.

Revised: 4 June 2018, last major update May 2018

Copyright: (c) 2003-2018,, Inc.

PubMed Search: Machine informatics [title]

Cite this page: Cheng, J. Machine learning. website. Accessed June 20th, 2018.
Definition / general
  • Science of using computer algorithms to learn from patterns present in a dataset and making predictions based on the learned patterns
Essential features
  • A machine learning algorithm is used to create a model from a dataset from which predictions are made
  • With the proliferation of open source tools, machine learning is now more accessible than before
  • A machine learning model can be built without any knowledge of computer programming; for beginners, Orange software is a good starting point ( Orange - Interactive data analysis [Accessed 7 May 2018])
  • Applications in pathology / laboratory medicine include molecular subtyping of cancer, image recognition / segmentation and identification of lesions in digital slides
  • Publicly available datasets applicable to machine learning can be found in the Internet
  • Target / outcome variable is the value predicted based on the values of other variables in a dataset; it is analogous to the dependent variable in statistics
    • Features that contribute towards making the prediction are equivalent to the independent variables in statistics
    • Terms "feature" and "variable" are used interchangeably
  • Supervised machine learning: relationships between variables and the target feature are discovered and predictions are made on a second dataset in which the target feature is unknown while the other variables are provided
  • Unsupervised machine learning: data are assigned to different classes based on the patterns and relationships found by the algorithm
Machine learning platforms
Machine learning algorithms
  • Linear regression
    • Given a set of points x and y, it finds the best fit line that goes through each pair of x and y points
    • Used in laboratories to validate a new method for a particular test by comparing the results between the new method and the reference method
    • Various software tools such as R and Python with scikit-learn simplify the process by performing all the necessary calculations
  • Logistic regression
    • Statistical method for solving classification problems
    • An equation demonstrates the relationship between the target variable and one or more variables
    • Predicts the probability of an outcome occurring
  • Naïve Bayes
    • Based on Bayes' theorem, with the assumption that the variables involved contribute independently to the outcome variable; this assumption may be wrong, hence the description "na├»ve"
  • Decision trees
    • An optimal decision tree is constructed to fit the dataset
    • Each node in the tree consists of a feature variable and each node splits into branches based on the value of the variable
    • End of a branch is the value predicted for the target variable
    • Outside of machine learning, decision trees are often used in diagnosis and treatment guidelines
  • Random forest
    • Popular and versatile machine learning method
    • Results from multiple decision trees are pooled together to arrive at the final prediction; each tree is generated using a random subset of the input dataset along with a random subset of the feature variables
  • Support vector machines
    • A line, plane or hyperplane separates points in a dataset, separating them into classes
  • Neural networks
    • Inspired by interconnections between neurons in biological neural networks
  • Convolutional neural network
    • A type of neural network used in image classification and natural language processing
    • Can be used for classifying histology images into different subtypes (e.g. cancer subtype or benign vs. malignant)
Steps in building a machine learning model
  • Collection of data
    • Data can come from a variety of sources including questionnaires, internet searches, databases and images
  • Preparation of data
    • CSV (comma separated values) file format, as well as other spreadsheet style formats are commonly used for training machine learning models; each feature is designated by a column and each row represents a record
    • Downloading a dataset from a Machine Learning data repository (e.g. UCI Machine Learning Repository) can replace the first 2 steps
  • Choose a programming language or machine learning platform
    • Python and R are currently the most popular programming languages used in machine learning; scikit-learn library is essential to performing machine learning tasks in Python; easiest way to install Python along with the scikit machine learning package is by installing Anaconda ( Downloads - Anaconda [Accessed 2 May 2018])
    • Orange and Knime are open source GUI (graphical user interface) based machine learning platforms; these are easy to use and require no programming experience
  • Choose a machine learning algorithm
  • Set the parameters for the model
  • Train the model
  • Test the performance of the model
    • K-fold cross validation: entire dataset is subdivided into K subsets; each subset acts as the validation / test set once and the rest of the data are used for training the model; model performance is assessed by averaging the results attained from each subset
    • Leave-one-out cross-validation: model is trained using the entire dataset except for one data point and the model is validated with the data point that was left out; process is repeated until every data point has been used as the test data point
    • Random stratification: set percentage of the dataset is randomly assigned to the training set and the remaining become the test set
  • Optimize the model
    • Fine tune the model parameters
    • Add new data to the dataset
    • Change the features in the dataset
    • Add additional features
Making your first model
  • Creating machine learning models with GUI based platforms like Orange is quick and easy
  • Download and install Orange through the following link: ( Orange - Interactive data analysis [Accessed 7 May 2018])
  • Launch the program and a "Welcome" screen will appear
  • Select the "New" option
  • In Orange, various tasks are done through "Widgets", which are represented by various icons on the left side of the user interface
  • Click on the "Datasets" widget found on the left side of the screen and drag it into the canvas (the empty portion of the screen on the right side); double click on it and choose a dataset from the list that appears; click on the "send data button"
  • Add the "Test & Score" widget to the canvas
  • Add the "Random Forest" widget to the canvas
  • Another machine learning algorithm instead of "Random Forest" can be chosen
  • Connect the "Test & Score" widget to the "Datasets" widget by clicking on it and connecting the line that appears to the "Datasets" widget
  • Connect the "Random Forest" widget to "Test & Score"
  • Canvas should look something like:
Missing Image

By Jerome Cheng, M.D.

Double click on "Test & Score" and the performance of the newly built model will be displayed
Clinical applications
Board review question #1
    Which Machine learning algorithm predicts an outcome by combining results from multiple decision trees?

  1. Linear regression
  2. Neural networks
  3. Random Forest
  4. Support vector machines
Board review answer #1
C. In Random forest, results from multiple decision trees are pooled together to arrive at the final prediction; each tree is generated using a random subset of the input dataset along with a random subset of the feature variables.
Board review question #2
    Which of the following looks for the best fitting straight line that goes through a set of X and Y points?

  1. Convolutional neural networks
  2. Linear regression
  3. Neural networks
  4. Random Forest
Board review answer #2
B. In linear regression, an entire set of X and Y points is used to arrive at the linear equation y = bx +a, where b is the slope and a is a constant. Linear regression can be used for validation of laboratory test results.