Molecular markers


Editor-in-Chief: Debra L. Zynger, M.D.
Emily M. Hartsough, B.S.
Pawel Mroz, M.D., Ph.D.

Topic Completed: 25 July 2019

Minor changes: 27 May 2021

Copyright: 2019-2021,, Inc.

PubMed Search: Next generation sequencing [title] review

Emily M. Hartsough, B.S.
Pawel Mroz, M.D., Ph.D.
Page views in 2020: 1,185
Page views in 2021 to date: 1,504
Cite this page: Hartsough EM, Mroz P. NGS-general. website. Accessed November 29th, 2021.
Definition / general
  • A high throughput technique that allows for the sequencing of millions of nucleic acids simultaneously in parallel
Essential features
  • Allows for high throughput and cost effective sequencing that facilitates clinical use
  • There are different methods of next generation sequencing, primarily divided into short read (pyrosequencing, sequencing by synthesis, sequencing by ligation, ion torrent) and long read (single molecule real time sequencing, nanopore sequencing)
  • Clinical applications primarily include oncology panels and germline panels as well as emerging whole exome sequencing / whole genome sequencing
  • Next generation sequencing (NGS), massively parallel sequencing, whole genome sequencing (WGS), whole exome sequencing (WES), RNA sequencing (RNA-Seq), methylation sequencing (Methyl-Seq), chromat and chromatin immunoprecipitation sequencing (ChIP-Seq)
CPT codes used for NGS testing
  • 81445 - targeted genomic sequence analysis panel
  • 81450 - targeted genomic sequence analysis panel
  • 81455 - targeted genomic sequence analysis panel
  • 81479 - unlisted molecular pathology procedure
Diagrams / tables

Images hosted on other servers:
Missing Image

Overview of major methods

Missing Image

Ion semiconductor sequencing

Missing Image

Bridging PCR

Missing Image

Emulsion PCR

Missing Image

Differences between sequencing

Missing Image

Sequencing pipelines and timelines

Missing Image

Third generation sequencing

  • Sanger sequencing:
    • Gold standard
    • Method: utilizes chain termination with modified di-deoxynucleotides to determine sequence
    • Advantages: can sequence any target (DNA, RNA, epigenetic changes), high quality, long read lengths
    • Disadvantages: high cost, low throughput, time consuming and insufficient sensitivity to identify somatic variants in tumor samples
    • Use: primarily research applications, limited / declining clinical applications, validation tool for NGS data
Advantages of NGS (compared with Sanger sequencing)
  • High throughput output
  • Improved resolution
  • Cost effective
Disadvantages of NGS (compared with Sanger sequencing)
  • Shorter read lengths
  • Decreased raw accuracy in areas of homology, repeat expansions, large indels, copy number variants (CNVs) or other structural variants
Overview of NGS methodology
  • Library (sample) preparation: random fragmentation of DNA followed by ligation of common adaptor sequences
    • Sample input: DNA molecules (from blood, bone marrow, buccal swab, saliva, formalin fixed paraffin embedded tissue, etc.)
    • Fragmentation: enzymatically (restriction enzyme, transposase), sonically or mechanically fragment the DNA into random sizes (typically 200 - 300 nucleotides for short read sequencing)
    • Modification of target DNA:
      • Adaptors: unique DNA sequences are added to the ends of the fragmented DNA samples that serve multiple functions:
        • Provide a bar code (index) containing patient identifiers for multiplexing different samples on the same run
        • Allow for hybridization of sequences to sequencing chips / beads
        • Serve as universal priming site for amplification and sequencing primers
    • Enrichment: this step is only needed if analyzing specific genomic regions (disease gene panels, exomes, etc.)
      • PCR amplification (amplicon):
        • Advantages: ideal for smaller genomic regions
        • Disadvantages: may miss target of interest (lower sensitivity)
      • Sequence capture / hybridization:
        • Baits (biotinylated RNA or DNA oligonucleotides) bind specific regions of DNA
        • Advantages: ideal for larger genomic regions
        • Disadvantages: lower enrichment for target regions due to off target capture (lower specificity)
  • Amplification and cluster generation:
    • Each molecule is immobilized and amplified on a surface (chip, flow cell, beads, nanoball, etc.) utilizing multiplex PCR allowing for the PCR amplicons derived from a single template fragment to be clustered in close proximity to the original molecule (Nat Rev Genet 2016;17:333)
      • Bead based systems: one adaptor complementary to a specific oligonucleotide fragment is fixed to a bead and amplification occurs via emulsion PCR, wherein millions of clonal DNA fragments are immobilized on a single bead
      • Solid surface systems: adaptor fixes the oligonucleotide to the chip in a specific location
      • Amplification occurs via bridge PCR directly on the slide, wherein forward and reverse primers are bound to the surface, which provide binding sites for complementary single stranded DNA fragments
        • Flow cells: patterned solid surface system that defines precise location of primers on the slide, allowing for higher throughput
      • DNA nanoballs: in solution, DNA is iteratively ligated, circularized and cleaved with four unique adaptor regions
      • Rolling circle amplification is then used to generate DNA nanoballs, which are distributed on a patterned slide specificity)
  • Sequencing:
    • Each cluster will act as an individual sequencing reaction and libraries from multiple samples can be sequenced simultaneously in parallel
    • Reads: output of sequencing, containing a series of nucleotides that is the sequence that represents the original template molecule
    • Paired end reads: allows for DNA template to be sequenced from both ends, allowing for both forward and reverse reads and allowing for improved mapping, coverage and throughput (Nat Rev Genet 2016;17:333)
      • Can aid in identification of structural rearrangements if there is asymmetry between forward and reverse reads
    • Categorization of NGS platforms:
      • Generations: (Nat Rev Genet 2016;17:333)
        • Second generation: reliant on PCR
          • Pyrosequencing (Roche 454 system)
          • Sequencing by ligation (AB SOLiD system, Complete Genomics, Polonator G.007, etc.)
          • Sequencing by synthesis (Illumina, Qiagen GeneReader)
          • Ion semiconductor sequencing (Ion Torrent by Thermo Fisher)
        • Third generation: real time sequencing without need for PCR
          • Single molecule real time (SMRT) (PacBio, Roche)
          • Nanopore (Oxford Nanopore Technologies)
      • Short read and long read: (Int J Mol Sci 2017;18:E308)
        • Short read: reads < 300 base pairs
          • Examples: sequencing by synthesis, sequencing by ligation, pyrosequencing
          • Advantages: low cost per Gb, high accuracy
          • Disadvantages: decreased alignment, limited detection of structural rearrangements
        • Long read: reads > 2.5 Kb
          • Examples: singe molecule real time (SMRT), nanopore
          • Advantages: improved alignment, detection of structural variations and large rearrangements, sequencing repetitive regions, discovery of novel RNA transcript isoforms, no reliance on PCR improves portability and turn around time
          • Disadvantages: high cost per Gb, low accuracy
Data output and analysis
  • Demultiplexing: pooled patient samples (sample libraries) that were sequenced simultaneously are then separated by barcodes specific to each patient sample (indices)
  • Reads are physically clustered together based on sequence similarity, and forward and reverse reads are aligned (paired)
  • Alignment:
    • Resequencing: sequence reads aligned to reference sequence (the reference genome of one individual)
    • De novo: sequence reads aligned to each other
  • Interpretation: pathogenic, likely pathogenic, variant of uncertain significance, likely benign, benign (J Mol Diagn 2017;19:4, Genet Med 2015;17:405)
Platforms for NGS
  • Pyrosequencing
    • First commercially successful NGS system
    • Sequencing mechanism:
      • Beads bound with template are distributed on a plate with beads containing enzymes
      • One of each of the four nucleotides is added iteratively to the sequencing reaction
      • Pyrophosphate (Ppi) is released during nucleotide incorporation and the release of Ppi equals the amount of incorporated nucleotide
    • Example: 454 system (Roche)
    • Advantages: fast, long read length
    • Disadvantages: high cost of reagents, relatively high error rate with polybase > 6, low throughput (J Biomed Biotechnol 2012;2012:251364)
  • Sequencing by ligation
    • Sequencing mechanism: hybridization and ligation of labelled probe to DNA strand
      • Utilizes octamer oligonucleotide probes containing two probe specific bases and six degenerate bases
      • Each probe will have one of four fluorescent dyes linked to the last base, a ligation site on the first base and a cleavage site on the fifth base
      • Sequencing occurs by complementary binding between the probe and template and the anchor blinds complementarily to the adaptor and serves as an initiation site for ligation
      • Released fluorescence is then imaged
      • Then fluorescent signal and last four nucleotides of the octamer are cleaved and the cycle continues
      • After several cycles of hybridization, ligation and cleavage, the DNA strand is denatured and another sequencing primer offset by one base is used to repeat the reaction
      • 5 total sequencing primers are used and the sequence of the fragment can be deduced after approximately 5 rounds of sequencing
    • Example: AB SOLiD system (Thermo Fisher)
    • Advantages: highest accuracy of second generation NGS, no reliance on a polymerase
    • Disadvantages: short sequencing reads (Nat Rev Genet 2016;17:333, J Biomed Biotechnol 2012;2012:251364)
  • Sequencing by synthesis (SBS)
    • Sequencing mechanism:
      • Each cycle, all four dNTPs are added to the flow cell, which contain different cleavable fluorescent dyes and a removable blocking group
      • Incorporation of the fluorescently labelled dNTP by the DNA polymerase terminates polymerization each cycle and the fluorescent signal identifies the specific nucleotide into the growing DNA strand
      • Fluorescent dye and blocking group are enzymatically cleaved so a new nucleotide can be added during the next cycle
    • Example: Illumina sequencing (MiSeq, HiSeq, NextSeq and NovaSeq platforms)
    • Advantages: high accuracy (sequences base by base), largest output, cheapest, less susceptible to homopolymer errors than single nucleotide addition platforms (Ion Torrent, pyrosequencing)
    • Disadvantages: cyclical nature leads to long sequencing times, short read assembly (Duncan: Diagnostic Molecular Pathology, Chapter 3 Next-Generation Sequencing in the Clinical Laboratory, 2017, Nat Rev Genet 2016;17:333, Nat Rev Genet 2016;17:333, J Biomed Biotechnol 2012;2012:251364)
  • Ion semiconductor sequencing
    • Sequencing mechanism: Ion Torrent sequencing (semiconductor sequencing)
      • DNA fragments are enriched on beads, which are then arrayed onto a microwell plate so that only one bead occupies one well
      • Unmodified single nucleotides are successively added to the microwell chip (sequencing by single nucleotide addition)
      • Incorporation of a nucleotide by DNA polymerase causes release of a proton, which is detected by an ion sensor that is sensitive to changes in pH each time a complementary nucleotide is added
      • Sequence is determined by evaluating the electrical signal intensity during each sequential nucleotide exposure
    • Example: Ion Torrent (Thermo Fisher)
    • Advantages: fast sequencing (no reliance on camera or fluorescence), low cost, smaller instrument size
    • Useful for point of care testing and gene panels
    • Disadvantages: less accurate with increased homopolymer error (Nat Rev Genet 2016;17:333)
  • Single molecule real time (SMRT) sequencing
    • Compared to other systems, there is no reliance on clonal amplification of target region or chemical cycling for each dNTP added
    • Sequencing mechanism:
      • Specialized flow cell with thousands of microwells, each containing a single immobilized polymerase and a copy of the target molecule, often a circular template that can be sequenced multiple times
      • Microwells are flooded with an excess of fluorescently labelled nucleotides
      • Dye is cleaved during incorporation by the DNA polymerase, allowing it to diffuse away from the sensor and allowing the active site of the DNA polymerase to become available for the next dNTP
      • A light detector within the microwell detects the signal in real time as the nucleotides are incorporated
    • Example: Pacific Biosciences
    • Advantages: fastest method, can sequence longer targets, decreases bias and error attributed to PCR, ideal for de novo genome assembly applications, real time data generation
    • Disadvantages: lower accuracy and throughput (Nat Rev Genet 2016;17:333, J Biomed Biotechnol 2012;2012:251364)
  • Nanopore
    • Compared to other systems, this system does not monitor incorporation of nucleotides or use a secondary signal (fluorescence, pH, etc.)
    • Sequencing mechanism:
      • Ionic current is passed through nanopore proteins
      • Target DNA is then passed through a nanopore and each base alters the current
      • Sequencing occurs in real time as the DNA molecule passes through the nanopore
      • Hairpin library structure allows forward and reverse strands to be sequenced (e.g. Oxford Nanopore Technologies)
    • Advantages: can sequence very long molecules (>10 Kb), lower cost, small and portable, real time data generation
    • Disadvantages: relatively high error rate (Nat Rev Genet 2016;17:333, J Biomed Biotechnol 2012;2012:251364)
Computer science
  • Overview: raw sequencing data is translated into individual sequences (reads), each read is mapped to its targeted region in the genome and this alignment identifies differences between the sample DNA and the standardized reference
  • Base calling and Demultiplexing: raw format data (fluorescence intensity, electrical impulse, etc.) is converted to a nucleotide sequence assigned to each position in the target DNA sequence (e.g. binary base call (BCL) format)
    • Intensity of the signal relative to the background noise will allow the base caller algorithm to generate confidence score associated with each nucleotide
    • Read data is demultiplexed by converting from BCL to FASTQ format, which contains a read identifier, nucleotide sequence and confidence scores
      • FAST-Q format:
        • Line 1: @Sequence identifier (and optional description)
        • Line 2: Raw sequence read: “ACTGACTG”
        • Line 3: Spacer: ‘+’
        • Line 4: Phred quality score of line 2 (probability base is incorrectly called by the sequencer): ‘!AA>**!C
          • Q = -log(10)P
            • Q = Phred quality score
            • P = Probability of an incorrect base call (calculated by peak shape and overlap at bases)
            • Example software: CASAVA (Illumina)
  • Sequence alignment:
    • Mapping all individual reads to locations on the genome
    • Requires read data and a reference sequence to serve as a standard against which individual reads are compared and aligned
      • Biologic factors (benign variants and mutations) and technical factors (errors in sequencing or inaccurate base calling) contribute to imperfect alignment
      • Penalties are generated for inexact alignment, at a certain penalty threshold, a read is considered unable to be aligned
    • Coverage (depth): number of times a nucleotide is sequenced, or the number of reads that align over a single base
      • Increased coverage verifies the base call at that position and increases confidence that the base call accurately represents the original target sequence
      • Twentyfold coverage (20 reads over target region) is required to be confident of the base called in a pure or germline specimen
      • Five hundred-one thousand-fold coverage is needed to identify variants in mixed samples like tumor specimens
    • Alignment formats: Sequence Alignment/Map (SAM) and binary BAM format stores alignment data indexed by reference sequence location
    • Example software: BWA, Bowtie 2, MAQ, Stampy, Novoalign
  • Variant calling:
    • After alignment to a reference genome, single nucleotide polymorphisms (SNPs), single nucleotide variants (SNVs) and insertion/deletions (indels) can be identified
    • Considers depth of coverage, variant frequency and alignment score
    • Example software: SAM tools Mpileup, GATK
  • Visualization of data:
    • Example software: integrative genomics viewer (IGV), UCSC genome browser
Types of NGS
  • DNA sequencing:
    • Data analysis: point mutations, insertions/deletions (indels), copy number variants (CNVs), structural variants
    • Whole genome sequencing:
      • Sequences both coding and non coding regions of the genome
      • Advantages: library preparation does not require enrichment or amplification (high specificity)
      • Disadvantages: high cost, less depth of coverage, complex data analysis and interpretation
    • Targeted sequencing:
      • Utilizes enrichment with specific primers to sequence certain areas of the genome
      • Advantages: allows more individual samples to be run with each sequencing reaction, improves depth of coverage of regions of interest and reduces cost
    • Whole exome sequencing:
      • Sequencing of only protein coding area of genome (~1 - 2%)
      • Need > 20X coverage per nucleotide for sufficient specificity and sensitivity for mutation detection
      • Advantages: more affordable for clinical use
    • Gene panels:
      • For known hereditary syndromes (e.g. Lynch syndrome, hereditary breast / ovarian cancer syndrome) or mitochondrial diseases
      • Depth of 80X sufficient to detect germline variants, > 500X coverage needed for somatic mutations
  • RNA sequencing (RNA-seq):
    • Utilizes reverse transcriptase PCR (RT-PCR)
    • Data analysis: differential expression, gene fusions, alternative splicing and RNA editing
Clinical applications
  • Oncology testing:
  • Germline testing:
    • Often utilizes WES or WGS to identify germline DNA variants in order to:
      • Diagnose heritable disease
      • Explain phenotypes that are likely genetic in origin
    • Commercial testing allows for SNP detection and analysis (e.g. 23andme, Ancestry, Gene Dx, Ambry genetics, etc.)
  • Microbiology:
  • Pharmacogenomics:

Introduction to different NGS platforms

Sequencing by synthesis (Illumina)

Board review style question #1

    What manner of next generation sequencing does the following image depict?

  1. Ion Torrent sequencing
  2. Pyrosequencing
  3. Sanger sequencing
  4. Sequencing by ligation
  5. Sequencing by synthesis
Board review style answer #1
E. Sequencing by synthesis

Comment Here

Reference: NGS-general
Board review style question #2
    Which of the following is true regarding long-read sequencing compared with short-read sequencing?

  1. It can more readily detect large rearrangements than short-read sequencing
  2. It is cheaper than short-read sequencing
  3. It is more accurate than short-read sequencing
  4. It utilizes PCR based amplification
Board review style answer #2
A. It can more readily detect large rearrangements than short-read sequencing

Comment Here

Reference: NGS-general
Back to top
Image 01 Image 02