Informatics, digital & computational pathology

Machine learning & deep learning

Natural language processing (NLP)

Editorial Board Member: Lewis A. Hassell, M.D.
David Nai, M.D.
Jerome Cheng, M.D.

Last author update: 16 February 2023
Last staff update: 1 March 2023

Copyright: 2022-2023,, Inc.

PubMed Search: Natural language processing

David Nai, M.D.
Jerome Cheng, M.D.
Page views in 2022: 8
Page views in 2023 to date: 406
Cite this page: Nai DW, Cheng J. Natural language processing (NLP). website. Accessed October 2nd, 2023.
Definition / general
  • Natural language processing (NLP): research and application that explores how computers can be used to understand and manipulate natural language text or speech to do useful things
  • Although AI has addressed the problem of human language as early as the 1950s, progress has come in very small, hard fought increments, each time with the introduction of new methods to address the problem
  • Programs are finally being made to understand English and other natural languages written in free text, with rich potential application in extracting and understanding pathology information (Multimed Tools Appl 2023;82:3713)
Essential features
  • NLP algorithms can automatically extract named entities from pathology reports
  • Word embeddings (interpreting words as vectors in an n-dimensional space) in combination with neural networks can categorize text into different categories
  • Suggested categories of applications of NLP to pathology (Am J Pathol 2022 Aug 17 [Epub ahead of print]):
    • Extraction of information from pathology reports
    • Summarization of pathology reports
    • Machine translation
    • Topic modeling (grouping keywords according to an intelligent scheme)
    • Workflow optimization / prescreening
  • Report generation
  • Free text versus discrete data:
    • Free text data occurs in an information system when text that is not already structured into form fields has to be processed to retrieve relevant information (e.g., the excision shows lepidic predominant adenocarcinoma of the lung, 1.4 cm, T1 N0, with negative resection margins)
    • Discrete data is organized into fields, making it much easier for a program to extract the appropriate data
    • For example, a pathologist might fill in a template in SoftPath:

Diagnosis Histology Tumor size (cm) T stage N stage Margin status
Adenocarcinoma Lepidic predominant 1.4 1 0 Negative
Background / timeline
  • 1980s: corpus linguistics and a rise in processing power allow significant work in machine learning algorithms (e.g., logistic regression)
  • 1990s: era of statistical natural language processing as well as support vector machines, etc.
  • 2000s: ascendance of unsupervised and semisupervised learning algorithms
  • 2010s: era of neural NLP as the use of deep neural networks and representation learning became predominant
  • Reference: Cancer 2017;123:114
    • In a review of the pathology NLP literature up to 2014, the most used technique was word / phase matching, followed by machine learning
      • The Unified Medical Language System and SNOMED were most frequently employed to encode information obtained from reports; extracting structural information, automated coding and case detection were the most common use cases (J Clin Pathol 2016 Jul 22 [Epub ahead of print])
Key concepts / terminology
  • Tokenization: automatic division of text into fundamental units or tokens (similar to words)
  • Stemming and lemmatization: reduction of words into their basic forms or roots (i.e., metastatic to metasta* or invasive to inva*)
  • Part of speech (POS) recognition: identification of nouns, verbs, adjectives, adverbs, prepositions, conjunctions, articles, subjects, objects, etc.
  • Bag of words: a method for representing a piece of text by considering only the set of words it contains, discarding all information about order and phraseology
  • Unified Medical Language System (UMLS): a set of files and software that codifies hundreds of thousands of medical terms across several health / biomedical vocabularies, allowing more effective interoperability between biomedical information systems and services
  • SNOMED: a systematically organized, computer processable collection of medical terms providing codes, synonyms and definitions used in clinical documentation and reporting
  • Automatic text summarization: the use of computer programs to reduce text written in natural languages such as English, whether by selection of important sentences and phrases (extractive) or true summarization by rewriting (abstractive)
  • Embeddings: digesting data into points in a vector space; for example, a pretrained language model such as bidirectional encoder representations from transformers (BERT)
  • Support vector machine (SVM): a machine learning model for classification in which a data point is viewed as a p dimensional vector and (p-1) dimensional hyperplanes are drawn to separate different categories of points
  • Neural network: a ubiquitous machine learning construct, capable of complex and nonlinear behavior, in which the nodes (arranged in layers) have inputs and outputs
  • Convolutional neural network (CNN): a neural network structured for image analysis / computer vision which involves convolution operations (i.e., the application of a filter or kernel across the pixels of an image)
  • Recurrent neural network: a fully connected neural network architecture in which some layers feed backward to earlier layers, forming a loop that gives the network a memory
    • Used in sequence labeling tasks as well as text classification and generation
    • Long short term memory builds off this design, giving the network a more robust memory
  • Long short term memory (LSTM): a neural network architecture useful for text and speech which implements a feedback loop to process not only single data points but entire sequences of data
  • Bidirectional encoder representations from transformers (BERT): an NLP solution, at the crux of which is the transformer, a deep learning model that adopts the mechanism of self attention (differentially weighting the importance of parts of a sequence of input)
    • BioBERT is a language model pretrained on medical texts
  • Latent Dirichlet allocation (LDA): a topic model in which a set of topics are attributed in different degrees to a set of documents according to a 3 level hierarchical Bayesian model
  • Transformer: a neural network model made up of 2 modules: an encoder and a decoder
    • The encoder uses a mechanism called attention to identify relationships between relevant words in the text
    • By masking, the decoder is only able to see the words up to the current position in text and has the challenge of predicting the next word
  • Generative pretrained transformer (GPT): a language model based on the transformer unit, which uses only the decoder half
    • Has gone through versions 2, 3 and 35 (being trained on progressively larger datasets), with a fourth likely in late 2023
  • Reference: Am J Pathol 2022 Aug 17 [Epub ahead of print]
Diagrams / tables

Images hosted on other servers:

Architecture of deep neural network

Bag of words model

Word embedding

Uses by pathologists
  • A significant amount of work has been done using the tools of NLP to examine the copious data in electronic health records (EHRs) and lab information systems (LISs)
    • Extraction of information (exam indication, how proximal the endoscope reached, whether a polyp was found etc.) from colonoscopy records of 4 U.S. institutions (J Am Med Inform Assoc 2017;24:986)
    • Identification of renal pathology cases requiring urgent clinician attention (e.g., infections in immunocompromised patients, transplant rejection, unexpected malignancies, glomerular crescents) (Am J Surg Pathol 2012;36:376)
    • Prediction of CPT codes from the text of pathology reports (J Pathol Inform 2019;10:13)
    • Extraction of information on bladder carcinomas (invasion, grade, presence of muscularis propria and CIS) from pathology reports (Urology 2017;110:84)
    • Automated detection of reportable cancer diagnoses from the text of pathology reports, leading to a reduction in workload for human tumor registrars (J Am Med Inform Assoc 2016;23:1077)
    • Automated classification of tumor morphology from the text of pathology reports from an Italian oncological center (J Biomed Inform 2021;116:103712)
    • Automatic extraction of mammographic and pathologic findings from free text mammogram and pathology report text for clinical decision support (Cancer 2017;123:114)
    • Automated coding of free text breast pathology reports with limited manual assistance (J Pathol Inform 2015;6:38)
    • Extraction of key information (specimen, procedure, pathologic entity) from a very large set of pathology reports using bidirectional encoders and LSTM (Sci Rep 2020;10:20265)
    • Automated retrieval and classification of colorectal pathology reports, distinguishing adenocarcinomas from others (J Pathol Inform 2022;13:100008)
    • Biomedical knowledge extraction from MEDLINE abstracts (Multimed Tools Appl 2023;82:3713)
Board review style question #1
Which of the following is the least realistic use case of natural language processing algorithms?

  1. Automated categorization or classification of disease entities (e.g., based on pathologist descriptions of histology)
  2. Extracting pertinent information from patient electronic health records (EHRs)
  3. Identifying potentially urgent situations (needing immediate clinician attention) from pathology report text
  4. Researching and writing articles for pathology scientific publications
  5. Template and form based pathology report generation
Board review style answer #1
D. Researching and writing articles for pathology scientific publications. All of the other uses for NLP are alluded to in the article. While natural language models such as ChatGPT and its many competitors are able to write fluent prose, they are still quite far from being able to produce journal articles.

Comment Here

Reference: Natural language processing (NLP)
Board review style question #2
What is natural language processing?

  1. An automated script that can search through a pathology information system and recognize key words suggesting a cancer diagnosis
  2. Creating word clouds from documents or collections of documents
  3. Programming of chatbots and personal assistants
  4. The branch of AI that deals with enabling computers to understand and generate human language
  5. Training a computer application to generate text mimicking a human
Board review style answer #2
D. The branch of AI that deals with enabling computers to understand and generate human language. This is the definition of natural language processing. All the other choices are examples of NLP application.

Comment Here

Reference: Natural language processing (NLP)
Board review style question #3
Which of these describes discrete (as opposed to free) text data?

  1. Extracting all mentions of a certain disease entity from a patient’s documents in the EMR system
  2. Filling out a job application form with fields for first and last names, street address, city, state and postal code
  3. Scanning through a novel for sentences that mention a particular character
  4. Searching the Comments sections of anatomical pathology reports
Board review style answer #3
B. Filling out a job application form with fields for first and last names, street address, city, state and postal code. Only this answer describes information that goes into predefined form fields, obviating the need to parse the grammar of free text. While the template sections of pathology reports in an LIS may incorporate form fields, the comment and diagnosis sections are usually composed of free text.

Comment Here

Reference: Natural language processing (NLP)
Back to top
Image 01 Image 02