Informatics, digital & computational pathology
Digital & computational pathology
Machine learning & deep learning

Editor-in-Chief: Debra Zynger, M.D.
Jerome Cheng, M.D.

Topic Completed: 7 May 2020

Minor changes: 26 June 2020

Copyright: 2003-2020,, Inc.

PubMed Search: Machine informatics [title]

Jerome Cheng, M.D.
Page views in 2019: 560
Page views in 2020 to date: 738
Cite this page: Cheng J. Fundamentals. website. Accessed September 26th, 2020.
Definition / general
  • Science of using computer algorithms to learn from patterns present in data and making predictions based on the learned patterns
Essential features
  • A machine learning algorithm is used to create a model from a dataset from which predictions are made
  • With the proliferation of open source tools, machine learning is now more accessible than before
  • A machine learning model can be built without any knowledge of computer programming; for beginners, Orange software is a good starting point (GitHub: Orange - Interactive data analysis [Accessed 26 March 2020])
  • Applications in pathology / laboratory medicine include molecular subtyping of cancer, image recognition / segmentation and identification of lesions in digital slides, digital slide stain normalization
  • Publicly available datasets applicable to machine learning can be found on the Internet
  • Target / outcome variable (label): value predicted based on the values of other variables in a dataset; it is analogous to the dependent variable in statistics
    • Features that contribute towards making the prediction are equivalent to the independent variables in statistics
    • Terms "feature" and "variable" are used interchangeably
  • Supervised machine learning: discovers relationships between feature variables and the target feature (label); the label has to be provided before the machine learning model can be trained to build a predictive model
  • Unsupervised machine learning: unlike supervised machine learning, the data label is not needed to find patterns in a dataset (e.g. similar sets of data points forming unique clusters in a t-SNE plot or groups formed through k-means clustering)
  • AutoML (automated machine learning): these are tools / software libraries that build, choose and optimize machine learning models with minimal user input; in some cases, all you have to do is provide the dataset and set the target feature and it will start building machine learning models for you
  • Overfitting: the machine learning model "memorizes" the training data and corresponding outcome it was given and performs poorly on data it has never seen before
Machine learning platforms
Machine learning algorithms
  • Linear regression
    • Given a set of points x and y, it finds the best fit line that goes through each pair of x and y points
    • Used in laboratories to validate a new method for a particular test by comparing the results between the new method and the reference method
    • Various software tools such as R and Python with scikit-learn simplify the process by performing all the necessary calculations
  • Logistic regression
    • Statistical method for solving classification problems
    • An equation based on the sigmoid function demonstrates the relationship between the target variable and one or more variables
    • Predicts the probability of an outcome occurring
  • Naïve Bayes
    • Based on Bayes' theorem, with the assumption that the variables involved contribute independently to the outcome variable; this assumption may be wrong, hence the description "na├»ve"
  • Decision trees
    • An optimal decision tree is constructed to fit the dataset
    • Each node in the tree consists of a feature variable and each node splits into branches based on the value of the variable
    • End of a branch is the value predicted for the target variable
    • Outside of machine learning, decision trees are often used in diagnosis and treatment guidelines
  • Random forest
    • Popular and versatile machine learning method
    • Results from multiple decision trees are pooled together to arrive at the final prediction; each tree is generated using a random subset of the input dataset along with a random subset of the feature variables
  • Gradient boosted trees
    • Like random forests, it also involves multiple decisions trees
    • Applied to nonimage datasets, it has been able to achieve better predictive accuracy than random forests and other machine learning methods in most cases
  • Support vector machines
    • A line, plane or hyperplane separates points in a dataset, separating them into classes
  • K-means clustering
    • Unsupervised machine learning algorithm that groups similar sets of data together
  • Dimensional reduction methods
    • Ex. PCA, t-SNE, autoencoders
    • Reduces number of feature dimensions for easier interpretation or visualization
  • Neural networks
    • Inspired by interconnections between neurons in biological neural networks
  • Convolutional neural network
    • A type of neural network often used for image classification (e.g. cancer subtype or benign versus malignant)
    • Also used for semantic segmentation, object detection with classification, generation of fake images and style transfer (generative adversarial network), natural language processing
  • Generative adversarial network
    • Involves 2 neural networks
      • Generator: creates fake data
      • Discriminator: distinguishes real from fake data
    • Goal of training is for the generator to become better at creating images that look real to the discriminator
  • Autoencoders
    • Type of neural network that transforms an input into an intermediate representation, from which the original input is recreated
  • Natural language processing
    • Text may be converted to a numerical matrix though word embeddings or bag-of-words representations; these embeddings or bag-of-words representations may be combined with other machine learning algorithms to make predictions from textual data
    • Word embedding techniques
Steps in building a machine learning model
  • Data collection and preparation
    • Usually the most time consuming process in building models
    • Data can come from a variety of sources including questionnaires, internet searches, databases and images
    • CSV (comma separated values) file format, as well as other spreadsheet style formats are commonly used for training machine learning models; each feature is designated by a column and each row represents a record
  • Choose a programming language or machine learning platform
    • Python and R are currently the most popular programming languages used in machine learning; scikit-learn library is essential to performing machine learning tasks in Python; easiest way to install Python along with the scikit machine learning package is by installing Anaconda (Anaconda: Downloads - Anaconda [Accessed 26 March 2020])
    • Orange and Knime are open source GUI (graphical user interface) based machine learning platforms; these are easy to use and require no programming experience
  • Choose a machine learning algorithm
  • Set the hyperparameters for the model
  • Split the data into training, validation and holdout sets
    • Validation set
      • During the process of training, the model is tested on the validation set to assess its performance (e.g. accuracy or AUC ROC); hyperparameters of the model may be altered during training process to improve the performance of the model
        • A training accuracy much higher than the validation accuracy is a sign of overfitting
      • Also referred to as the development set
    • Holdout dataset
      • Used to assess the performance of the final model on unseen data after it has been fully trained; unlike the validation set, it is never exposed to the trained model and should give a better measure of performance
    • For smaller datasets, the holdout set may be omitted and model accuracy can be assessed using cross validation methods
    • Commonly used ratios for training, validation and holdout sets
      • 70/15/15
      • 60/20/20
      • Very large datasets can have a larger proportion of the data in the training set e.g. 80/10/10 or 90/5/5
  • Train and test the model
    • Test the performance of the model on the validation set
    • Commonly used metrics:
      • Accuracy
      • Log Loss
      • AUC (Area Under ROC curve)
    • For smaller sample sizes:
      • K-fold cross validation: entire dataset is subdivided into K subsets; each subset acts as the validation / test set once and the rest of the data are used for training the model; model performance is assessed by averaging the results attained from each subset
      • Leave-one-out cross-validation: model is trained using the entire dataset except for one data point and the model is validated with the data point that was left out; process is repeated until every data point has been used as the test data point
      • Random stratification: set percentage of the dataset is randomly assigned to the training set and the remaining become the test set
  • Optimize the model
    • Fine tune the model parameters
    • Add new data to the dataset
    • Change the features in the dataset
    • Add additional features
    • Image Augmentation (for convolutional neural networks)
  • Test the performance of the final model on the holdout set
    • Gives a better measure of real world model performance
Making your first model
  • Creating machine learning models with GUI based platforms like Orange is quick and easy
  • Download and install Orange through the following link: (GitHub: Orange - Interactive data analysis [Accessed 26 March 2020])
  • Launch the program and a "Welcome" screen will appear
  • Select the "New" option
  • In Orange, various tasks are done through "Widgets", which are represented by various icons on the left side of the user interface
  • Click on the "Datasets" widget found on the left side of the screen and drag it into the canvas (the empty portion of the screen on the right side); double click on it and choose a dataset from the list that appears; click on the "Send data" button
  • Add the "Test & Score" widget to the canvas
  • Add the "Random Forest" widget to the canvas
  • Another machine learning algorithm instead of "Random Forest" can be chosen
  • Connect the "Test & Score" widget to the "Datasets" widget by clicking on it and connecting the line that appears to the "Datasets" widget
  • Connect the "Random Forest" widget to "Test & Score"
  • Canvas should look something like:

Double click on "Test & Score" and the performance of the newly built model will be displayed
Board review style question #1
    Which machine learning algorithm predicts an outcome by combining results from multiple decision trees?

  1. Linear regression
  2. Neural networks
  3. Random forest
  4. Support vector machines
Board review answer #1
C. In random forest, results from multiple decision trees are pooled together to arrive at the final prediction; each tree is generated using a random subset of the input dataset along with a random subset of the feature variables.

Reference: Machine learning & deep learning - fundamentals

Comment Here
Board review style question #2
    Which of the following looks for the best fitting straight line that goes through a set of X and Y points?

  1. Convolutional neural networks
  2. Linear regression
  3. Neural networks
  4. Random Forest
Board review answer #2
B. In linear regression, an entire set of X and Y points is used to arrive at the linear equation y = bx +a, where b is the slope and a is a constant. Linear regression can be used for validation of laboratory test results.

Reference: Machine learning & deep learning - fundamentals

Comment Here
Back to top
Image 01 Image 02