Table of Contents

Definition / general | Essential features | Terminology | Machine learning platforms | Data repositories | Machine learning algorithms | Steps in building a machine learning model | Making your first model | Clinical applications | Additional references | Board review question #1 | Board review answer #1 | Board review question #2 | Board review answer #2**Cite this page:**Cheng J. Machine learning & deep learning - fundamentals. PathologyOutlines.com website. http://www.pathologyoutlines.com/topic/informaticsmachinelearn.html. Accessed January 24th, 2020.

Definition / general

- Science of using computer algorithms to learn from patterns present in a dataset and making predictions based on the learned patterns

Essential features

- A machine learning algorithm is used to create a model from a dataset from which predictions are made
- With the proliferation of open source tools, machine learning is now more accessible than before
- A machine learning model can be built without any knowledge of computer programming; for beginners, Orange software is a good starting point (Github.com: Orange - Interactive data analysis [Accessed 7 May 2018])
- Applications in pathology / laboratory medicine include molecular subtyping of cancer, image recognition / segmentation and identification of lesions in digital slides
- Publicly available datasets applicable to machine learning can be found in the Internet

Terminology

- Target / outcome variable is the value predicted based on the values of other variables in a dataset; it is analogous to the dependent variable in statistics
- Features that contribute towards making the prediction are equivalent to the independent variables in statistics
- Terms "feature" and "variable" are used interchangeably

- Supervised machine learning: relationships between variables and the target feature are discovered and predictions are made on a second dataset in which the target feature is unknown while the other variables are provided
- Unsupervised machine learning: data are assigned to different classes based on the patterns and relationships found by the algorithm

Machine learning platforms

Data repositories

Machine learning algorithms

- Linear regression
- Given a set of points x and y, it finds the best fit line that goes through each pair of x and y points
- Used in laboratories to validate a new method for a particular test by comparing the results between the new method and the reference method
- Various software tools such as R and Python with scikit-learn simplify the process by performing all the necessary calculations

- Logistic regression
- Statistical method for solving classification problems
- An equation demonstrates the relationship between the target variable and one or more variables
- Predicts the probability of an outcome occurring

- Naïve Bayes
- Based on Bayes' theorem, with the assumption that the variables involved contribute independently to the outcome variable; this assumption may be wrong, hence the description "naïve"

- Decision trees
- An optimal decision tree is constructed to fit the dataset
- Each node in the tree consists of a feature variable and each node splits into branches based on the value of the variable
- End of a branch is the value predicted for the target variable
- Outside of machine learning, decision trees are often used in diagnosis and treatment guidelines

- Random forest
- Popular and versatile machine learning method
- Results from multiple decision trees are pooled together to arrive at the final prediction; each tree is generated using a random subset of the input dataset along with a random subset of the feature variables

- Support vector machines
- A line, plane or hyperplane separates points in a dataset, separating them into classes

- Neural networks
- Inspired by interconnections between neurons in biological neural networks

- Convolutional neural network
- A type of neural network used in image classification and natural language processing
- Can be used for classifying histology images into different subtypes (e.g. cancer subtype or benign vs. malignant)

Steps in building a machine learning model

- Collection of data
- Data can come from a variety of sources including questionnaires, internet searches, databases and images

- Preparation of data
- CSV (comma separated values) file format, as well as other spreadsheet style formats are commonly used for training machine learning models; each feature is designated by a column and each row represents a record
- Downloading a dataset from a Machine Learning data repository (e.g. UCI Machine Learning Repository) can replace the first 2 steps

- Choose a programming language or machine learning platform
- Python and R are currently the most popular programming languages used in machine learning; scikit-learn library is essential to performing machine learning tasks in Python; easiest way to install Python along with the scikit machine learning package is by installing Anaconda (anaconda.com: Downloads - Anaconda [Accessed 2 May 2018])
- Orange and Knime are open source GUI (graphical user interface) based machine learning platforms; these are easy to use and require no programming experience

- Choose a machine learning algorithm
- Set the parameters for the model
- Train the model
- Test the performance of the model
- K-fold cross validation: entire dataset is subdivided into K subsets; each subset acts as the validation / test set once and the rest of the data are used for training the model; model performance is assessed by averaging the results attained from each subset
- Leave-one-out cross-validation: model is trained using the entire dataset except for one data point and the model is validated with the data point that was left out; process is repeated until every data point has been used as the test data point
- Random stratification: set percentage of the dataset is randomly assigned to the training set and the remaining become the test set

- Optimize the model
- Fine tune the model parameters
- Add new data to the dataset
- Change the features in the dataset
- Add additional features

Making your first model

- Creating machine learning models with GUI based platforms like Orange is quick and easy
- Download and install Orange through the following link: (Github.com: Orange - Interactive data analysis [Accessed 7 May 2018])
- Launch the program and a "Welcome" screen will appear
- Select the "New" option
- In Orange, various tasks are done through "Widgets", which are represented by various icons on the left side of the user interface
- Click on the "Datasets" widget found on the left side of the screen and drag it into the canvas (the empty portion of the screen on the right side); double click on it and choose a dataset from the list that appears; click on the "send data button"
- Add the "Test & Score" widget to the canvas
- Add the "Random Forest" widget to the canvas
- Another machine learning algorithm instead of "Random Forest" can be chosen
- Connect the "Test & Score" widget to the "Datasets" widget by clicking on it and connecting the line that appears to the "Datasets" widget
- Connect the "Random Forest" widget to "Test & Score"
- Canvas should look something like:

Double click on "Test & Score" and the performance of the newly built model will be displayed

Clinical applications

- Applications of machine learning to medicine are vast and include:
- Risk stratification (Infect Control Hosp Epidemiol 2018;39:425)
- Predicting drug response (Clin Gastroenterol Hepatol 2010;8:143)
- Cancer metastases detection in lymph nodes (JAMA 2017;318:2199)
- Molecular subtyping of cancer (Brief Bioinform 2018 Apr 12 [Epub ahead of print])
- Prediction of susceptibility to Vibrio Cholerae infection (J Infect Dis 2018 Apr 12 [Epub ahead of print])
- Blood pressure estimation using ECG signals (Sensors (Basel) 2018;18:E1160)

Additional references

Board review question #1

- Which Machine learning algorithm predicts an outcome by combining results from multiple decision trees?
- Linear regression
- Neural networks
- Random Forest
- Support vector machines

Board review answer #1

**C**. In Random forest, results from multiple decision trees are pooled together to arrive at the final prediction; each tree is generated using a random subset of the input dataset along with a random subset of the feature variables.

Board review question #2

- Which of the following looks for the best fitting straight line that goes through a set of X and Y points?
- Convolutional neural networks
- Linear regression
- Neural networks
- Random Forest

Board review answer #2

**B**. In linear regression, an entire set of X and Y points is used to arrive at the linear equation y = bx +a, where b is the slope and a is a constant. Linear regression can be used for validation of laboratory test results.

Advertisement