Informatics, digital & computational pathology
Machine learning & deep learning
Machine learning fundamentals

Editor-in-Chief: Debra L. Zynger, M.D.
Jerome Cheng, M.D.

Topic Completed: 3 January 2021

Minor changes: 3 January 2021

Copyright: 2003-2021,, Inc.

PubMed Search: Machine informatics [title]

Jerome Cheng, M.D.
Page views in 2020: 928
Page views in 2021 to date: 56
Cite this page: Cheng J. Machine learning fundamentals. website. Accessed January 22nd, 2021.
Definition / general
  • Science of using computer algorithms to learn from patterns present in data and making predictions based on the learned patterns
Essential features
  • A machine learning algorithm is used to create a model from a dataset from which predictions are made
  • With the proliferation of open source tools, machine learning is now more accessible than before
  • A machine learning model can be built without any knowledge of computer programming; for beginners, Orange software is a good starting point (GitHub: Orange - Interactive data analysis [Accessed 26 March 2020])
  • Applications in pathology / laboratory medicine include molecular subtyping of cancer, image recognition / segmentation and identification of lesions in digital slides, digital slide stain normalization
  • Publicly available datasets applicable to machine learning can be found on the Internet
  • AutoML (automated machine learning): these are tools / software libraries that build, choose and optimize machine learning models with minimal user input; in some cases, all you have to do is provide the dataset and set the target feature and it will start building machine learning models for you
  • Deep learning: refers to neural networks with many layers; many convolutional neural networks fall under that category due to the large number of layers (e.g. convolution or pooling layers). Other types of neural networks with multiple layers (such as artificial neural networks with several hidden layers) also fall under this classification
  • Overfitting: the machine learning model "memorizes" the training data and corresponding outcome it was given and performs poorly on data it has never seen before
  • Supervised machine learning: discovers relationships between feature variables and the target feature (label); the label has to be provided before the machine learning model can be trained to build a predictive model
  • Target / outcome variable (label): value predicted based on the values of other variables in a dataset; it is analogous to the dependent variable in statistics
    • Features that contribute towards making the prediction are equivalent to the independent variables in statistics
    • Terms "feature" and "variable" are used interchangeably
  • Unsupervised machine learning: unlike supervised machine learning, the data label is not needed to find patterns in a dataset (e.g. similar sets of data points forming unique clusters in a t-SNE plot or groups formed through k-means clustering)
Machine learning platforms
Machine learning algorithms
  • Linear regression
    • Given a set of points x and y, it finds the best fit line that goes through each pair of x and y points
    • Used in laboratories to validate a new method for a particular test by comparing the results between the new method and the reference method
    • Various software tools such as R and Python with scikit-learn simplify the process by performing all the necessary calculations
  • Logistic regression
    • Statistical method for solving classification problems
    • An equation based on the sigmoid function demonstrates the relationship between the target variable and one or more variables
    • Predicts the probability of an outcome occurring
  • Naïve Bayes
    • Based on Bayes' theorem, with the assumption that the variables involved contribute independently to the outcome variable; this assumption may be wrong, hence the description "na├»ve"
  • Decision trees
    • An optimal decision tree is constructed to fit the dataset
    • Each node in the tree consists of a feature variable and each node splits into branches based on the value of the variable
    • End of a branch is the value predicted for the target variable
    • Outside of machine learning, decision trees are often used in diagnosis and treatment guidelines
  • Random forest
    • Popular and versatile machine learning method
    • Results from multiple decision trees are pooled together to arrive at the final prediction; each tree is generated using a random subset of the input dataset along with a random subset of the feature variables
  • Gradient boosted trees
    • Like random forests, it also involves multiple decisions trees
    • Applied to nonimage datasets, it has been able to achieve better predictive accuracy than random forests and other machine learning methods in most cases
  • Support vector machines
    • A line, plane or hyperplane separates points in a dataset, separating them into classes
  • K-means clustering
    • Unsupervised machine learning algorithm that groups similar sets of data together
  • Dimensional reduction methods
    • Ex. PCA, t-SNE, autoencoders
    • Reduces number of feature dimensions for easier interpretation or visualization
  • Neural networks
    • Inspired by interconnections between neurons in biological neural networks
  • Convolutional neural network
    • A type of neural network often used for image classification (e.g. cancer subtype or benign versus malignant)
    • Also used for semantic segmentation, object detection with classification, generation of fake images and style transfer (generative adversarial network), natural language processing
  • Generative adversarial network
    • Involves 2 neural networks
      • Generator: creates fake data
      • Discriminator: distinguishes real from fake data
    • Goal of training is for the generator to become better at creating images that look real to the discriminator
  • Autoencoders
    • Type of neural network that transforms an input into an intermediate representation, from which the original input is recreated
  • Natural language processing
    • Text may be converted to a numerical matrix though word embeddings or bag-of-words representations; these embeddings or bag-of-words representations may be combined with other machine learning algorithms to make predictions from textual data
    • Word embedding techniques
Steps in building a machine learning model
  • Data collection and preparation
    • Usually the most time consuming process in building models
    • Data can come from a variety of sources including questionnaires, internet searches, databases and images
    • CSV (comma separated values) file format, as well as other spreadsheet style formats are commonly used for training machine learning models; each feature is designated by a column and each row represents a record
  • Choose a programming language or machine learning platform
    • Python and R are currently the most popular programming languages used in machine learning; scikit-learn library is essential to performing machine learning tasks in Python; easiest way to install Python along with the scikit machine learning package is by installing Anaconda (Anaconda: Downloads - Anaconda [Accessed 26 March 2020])
    • Orange and Knime are open source GUI (graphical user interface) based machine learning platforms; these are easy to use and require no programming experience
  • Choose a machine learning algorithm
  • Set the hyperparameters for the model
  • Split the data into training, validation and holdout sets
    • Validation set
      • During the process of training, the model is tested on the validation set to assess its performance (e.g. accuracy or AUC ROC); hyperparameters of the model may be altered during training process to improve the performance of the model
        • A training accuracy much higher than the validation accuracy is a sign of overfitting
      • Also referred to as the development set
    • Holdout dataset
      • Used to assess the performance of the final model on unseen data after it has been fully trained; unlike the validation set, it is never exposed to the trained model and should give a better measure of performance
    • For smaller datasets, the holdout set may be omitted and model accuracy can be assessed using cross validation methods
    • Commonly used ratios for training, validation and holdout sets
      • 70/15/15
      • 60/20/20
      • Very large datasets can have a larger proportion of the data in the training set e.g. 80/10/10 or 90/5/5
  • Train and test the model
    • Test the performance of the model on the validation set
    • Commonly used metrics:
      • Accuracy
      • Log Loss
      • AUC (Area Under ROC curve)
    • For smaller sample sizes:
      • K-fold cross validation: entire dataset is subdivided into K subsets; each subset acts as the validation / test set once and the rest of the data are used for training the model; model performance is assessed by averaging the results attained from each subset
      • Leave-one-out cross-validation: model is trained using the entire dataset except for one data point and the model is validated with the data point that was left out; process is repeated until every data point has been used as the test data point
      • Random stratification: set percentage of the dataset is randomly assigned to the training set and the remaining become the test set
  • Optimize the model
    • Fine tune the model parameters
    • Add new data to the dataset
    • Change the features in the dataset
    • Add additional features
    • Image Augmentation (for convolutional neural networks)
  • Test the performance of the final model on the holdout set
    • Gives a better measure of real world model performance
Making your first model
  • Creating machine learning models with GUI based platforms like Orange is quick and easy
  • Download and install Orange through the following link: (GitHub: Orange - Interactive data analysis [Accessed 26 March 2020])
  • Launch the program and a "Welcome" screen will appear
  • Select the "New" option
  • In Orange, various tasks are done through "Widgets", which are represented by various icons on the left side of the user interface
  • Click on the "Datasets" widget found on the left side of the screen and drag it into the canvas (the empty portion of the screen on the right side); double click on it and choose a dataset from the list that appears; click on the "Send data" button
  • Add the "Test & Score" widget to the canvas
  • Add the "Random Forest" widget to the canvas
  • Another machine learning algorithm instead of "Random Forest" can be chosen
  • Connect the "Test & Score" widget to the "Datasets" widget by clicking on it and connecting the line that appears to the "Datasets" widget
  • Connect the "Random Forest" widget to "Test & Score"
  • Canvas should look something like:

Double click on "Test & Score" and the performance of the newly built model will be displayed
Board review style question #1
    Which machine learning algorithm predicts an outcome by combining results from multiple decision trees?

  1. Linear regression
  2. Neural networks
  3. Random forest
  4. Support vector machines
Board review style answer #1
C. In random forest, results from multiple decision trees are pooled together to arrive at the final prediction; each tree is generated using a random subset of the input dataset along with a random subset of the feature variables.

Reference: Machine learning & deep learning - fundamentals

Comment Here
Board review style question #2
    Which of the following looks for the best fitting straight line that goes through a set of X and Y points?

  1. Convolutional neural networks
  2. Linear regression
  3. Neural networks
  4. Random Forest
Board review style answer #2
B. In linear regression, an entire set of X and Y points is used to arrive at the linear equation y = bx +a, where b is the slope and a is a constant. Linear regression can be used for validation of laboratory test results.

Reference: Machine learning & deep learning - fundamentals

Comment Here
Back to top
Image 01 Image 02