Molecular markers

DNA sequencing

Last author update: 28 February 2020
Last staff update: 26 May 2021

Copyright: 2020,, Inc.

PubMed Search: DNA sequencing[TI] molecular pathology

Rodney E. Shackelford, D.O., Ph.D.
Page views in 2022: 226
Page views in 2023 to date: 75
Cite this page: Shackelford RE. DNA sequencing. website. Accessed March 22nd, 2023.
Diagrams / tables

Images hosted on other servers:

Nucleotide bases and gel for sequencing

Initial sequencing
  • The first nucleic acid sequencing began in the mid 1960s using 2 dimensional chromatography
  • The intial protocols were innovative but inefficient by today's standards
  • For example, in 1973 Gilbert and Maxam published the sequence of the 24 bp lac operator using a protocol that required 300 - 700 grams of bacteria, multiple purification steps, conversion of the selected DNA fragments into RNA and digestion of these sequences with different RNases (Proc Natl Acad Sci USA 1973;70:3581)

Several properties of DNA made the initial attempts at DNA sequencing difficult:
  1. The chemical properties of different DNA molecules were so similar that separating them appeared difficult
  2. Compared to amino acids in proteins, DNA was much longer
  3. No base specific DNases were known and previous protein sequencing methods had depended upon proteases that cut adjacent to specific amino acids

  • Since structurally simpler RNA molecules did not have these drawbacks, they were among the first larger nucleic acids sequenced; the first relatively large nucleic acid sequenced was the Esherichia coli alanine tRNA in 1965 (Science 1965;147:1462, Nucleic Acids Res 2007;35:6227)

Later efforts
  • In the 1970s, Sanger and Maxam-Gilbert developed chain termination and base specific chemical cleaving methodologies that overcame many of the initial problems, vastly improving sequencing efficiency, especially when applied to DNA
  • Since this time, improved sequencing methods, combined with automated analysis and development in bioinformatics, has led to doubling of the known nucleic acid sequences every 16 months for the past 40 years, a logarithmic nine orders of magnitude database increase since 1965

Recent techniques
  • Many new "next generation" DNA sequencing technologies and platforms are currently in development and the speed, efficiency, cost effectiveness and accuracy of DNA sequencing technology has been steadily improving
  • Since the early 1990s, most have employed the Sanger dideoxynucleotide chain terminating method, although other DNA sequencing technologies improvements are being developed

  • The enormous increase in the ability to sequence nucleic acids has allowed the complete sequencing of the human genome and the genomes of over 180 other species, including:
    • Microorganisms: Bacillus anthracis, Caenorhabditis elegans, Helicobacter pylori, Mycobacterium tuberculosis, Saccharomyces cerevisiae, Yersinia pestis
    • Animal kingdom: chicken (Gallus gallus), chimpanzee (Pan troglodytes), dog (Canis familiaris), fruit fly (Drosophila melanogaster), honey bee (Apis mellifera), Japanese puffer fish (Takifugu rubripes), mouse (Mus musculus), rat (Rattus norvegicus), sea squirt (Ciona intestinalis)
  • In addition, the woolly mammoth (Mammuthus primigenius) genome and a "first draft" of the Neanderthal (Homo sapiens neanderthalensis) genome have been published, demonstrating that the DNA sequences of long extinct complex organisms can be sequenced from fossil sources
  • DNA sequencing is used in many areas, including forensics, disease diagnosis, personalized medical (pharmacogenomics), tissue identification, transplantation typing, biotechnology, epidemiology, medical research, comparative genomics and evolution, archeology and anthropology
  • DNA sequencing has also raised many important bioethical issues related to personal privacy, public health and safety
Capillary electrophoresis
  • Capillary electrophoresis (CAE) uses narrow bore capillaries with internal diameters below 100 mm
  • The high surface area to volume of the tubes increases heat dissipation, allowing a higher electrical field to be used for electrophoresis, leading to an 8 to 10 fold faster separation time than slab gel electrophoresis
  • The use of improve matrices, such as non cross linked polymer solutions, has allowed ~1000 bp to be read efficiently and accurately in as little as 2 hours
  • The introduction of capillary electrophoresis was a milestone that allowed the sequencing of the human genome (Nucleic Acids Res 2007;35:6227)
HeliScope Sequencer
  • The HeliScope sequencing platform does not clonally amplify the DNA fragments to be sequenced
  • Instead, DNA libraries are produced by random fragmentation and poly A tails are added to the fragments
  • The fragments are denatured and captured by surface tethered poly T oligomers, yielding a disordered array of primed, single molecule sequencing templates
  • Each sequencing cycle consists of the addition of DNA polymerase and one uniquely fluorescently labeled dNTP, with the reaction moving forward by incorporating one dNTP (or more with homopolymeric nucleotide runs)
  • Following a wash step to remove the unincorporated fluorescent dNTPs, the remaining florescent nucleotide type and position are interrogated and their positions are recorded
  • The fluorescent group is then cleaved and another sequencing round is initiated with a different dNTP, until all four have been placed on the array
  • Multiple four phase sequencing cycles are performed, with up to 25 - 45 bp being sequenced, with billions of strands measured in a single run
  • Not surprisingly, each run on this platform can require a large amount of computer storage space - about 14 terabytes
  • As with the 454 FLX sequencer, this technique is not efficient at sequencing longer homopolymer runs
  • The resolution of the optical microscopy is a few hundred nM so the captured molecules must be this distance apart for optimal image resolution
  • This platform has been used to sequence the ~7,000 bp M13 viral genome (Science 2008;320:106)
Illumina Genome Analyzer
  • The Illumina Genome Analyzer, similar to the 454 FLX sequencer, begins by constructing a sequence mixed library of DNA fragments of up to several hundred bp
  • Adaptor sequences with forward and reverse PCR primers are attached to each end of these DNA fragments
  • The adaptor linked sequences are then amplified by bridge PCR, where both the forward and reverse PCR primers are tethered to a solid substrate by a flexible linker
  • During PCR, the amplified sequences remain immobilized and clustered in one location on the array; eventually there are 1,000 to 1 million clustered amplicons, sufficient for reporting incorporated bases as the required signal intensity
  • Several million different clusters can be amplified and distinguished at different locations within eight independent lanes that are on a single flow cell; thus, eight different libraries can be sequenced simultaneously
  • Once the amplification step is completed, the amplicons are made single stranded and a universal sequencing primer is hybridized to a universal sequence flanking the DNA to be sequenced
  • Sequencing occurs with each single base addition, using a modified DNA polymerase and a mix of four modified dNTPs, each with two different modifications:
    1. The dNTPs have a chemically cleavable moiety (a "reversible terminator") at their 3' hydroxyl position that allows only one nucleotide incorporation
    2. Each dNTP has a distinct fluorescent label that can also be chemically cleaved
  • Thus, after one round of dNTP incorporation has occurred, it can be photometically measured
  • The reversible terminator and fluorescent label are then chemically cleaved, washed away and the DNA sequencing can proceed by adding another labeled dNTP series
  • The read length with this technique is about 36 bp; longer reads are possible but have higher error rates
  • The main limiting factors are incomplete cleavage of the fluorescent labels and terminating moieties, all of which result in signal decay and dephasing
  • Homopolyer stretches are far less of a problem with this sequencing platform that with the 454 FLX sequencer
  • Combining different sequences allows this platform to sequence several hundred nucleotides per run
Maxam-Gilbert sequencing
  • In 1976 - 77, Allan Maxam and Walter Gilbert developed a gel based DNA sequencing method, also called "chemical sequencing," that used 4 base specific chemical cleavage reactions to cut 32P-end labeled double stranded DNA fragments (Proc Natl Acad Sci USA 1977;74:560)
  • The base specific cuts occur at a small proportion of either
    1. Both purines (A + G)
    2. Preferentially at A (A > G)
    3. The pyrimidines (C + T)
    4. At cytosines only
  • The resulting DNA fragments are then denatured, run side by side in slab gel electrophoresis, autoradiographed and analyzed

  • Advantages:
    1. Purified DNA can be read directly
    2. Homopolymeric DNA runs are sequenced as efficiently as heterogeneous DNA sequences
    3. Can be used to analyze DNA protein interactions (i.e. footprinting)
    4. Can be used to analyze nucleic acid structure and epigenetic modifications to DNA

  • Disadvantages - this method is not commonly used today because:
    1. It requires extensive use of hazardous chemicals
    2. It has a relatively complex set up / technical complexity
    3. It is difficult to "scale up" and cannot be used to analyze more than 500 base pairs
    4. The read length decreases from incomplete cleavage reactions
    5. It is difficult to make Maxam-Gilbert sequencing based DNA kits

  • Modifications are used to analyze protein DNA interactions (Wikipedia: DNA Footprinting [Accessed 30 May 2018]) and DNA secondary structure, to locate rare bases (such as Hoogsteen base pairs, Wikipedia: Hoogsteen Base Pair [Accessed 30 May 2018]) and base modifications and to resolve ambiguities that arise in dideoxynucleotide sequencing

  • Initially, the useful DNA sequencing read length from a gel was about 100 bp
  • This was significantly increased by using 35S labeled DNA and by using gels with narrower lanes, gel gradient systems, gel to plate binders and temperature control systems to reduce band distortions
  • However, despite all improvements, these limitations remain:
    1. Gel electrophoresis is limited to 700 - 900 bp, with 400 - 500 bp more commonly attained
    2. The first 15 - 40 bp are often difficult to interpret (Nucleic Acids Res 2007;35:6227)
    3. Sequencing techniques based on slab gel electrophoresis require cumbersome gels, buffers, time spent loading and running the gels, autoradiography and analysis; all lower the amount of DNA that can be sequenced
  • To overcome the limitations of slab gel electrophoresis and the manual reading of DNA sequences, other innovations were introduced


Maxam-Gilbert sequencing method

  • Several next generation DNA sequencing techniques employ nanotechnology to increase sequencing speed
  • Pacific Biosciences is working on a DNA sequencing platform to produce 100 Gb of sequence data/hour or enough sequencing capacity and speed to give one fold coverage to the diploid human genome in 4 minutes; it is based on Single Molecule Real Time, where natural DNA synthesis by DNA polymerase is observed and measured by interrogating thousands of nanometer scale aperture chambers in 100 nm metal film (called zero mode waveguides) deposited on a silicon dioxide substrate
  • Each aperture provides a nanophotonic visualization chamber with a detection volume of 20 zeptoliters (10-21 liters)
  • This small volume allows the detection of a single molecule in a background of thousands of labeled nucleotides
  • To date, this technology achives read lengths of 1500 bp, with a read rate of 10 bases/sec
Definition / general
  • Pyrosequencing is a real time DNA sequencing method based on the detection of pyrophosphate (PPi) released during the incorporation of a complementary nucleotide in a DNA polymerization reaction

  • The reaction occurs as follows:

            DNA polymerase
    (DNA)n + dNTP ------------------------------>(DNA)n+1 + PPi

    PPi + adenosine 5' phosphosulfate ------------------->ATP + SO4-2

    ATP + luciferin + O2 ---------> AMP + PPi + CO2 + one photon

  • The sequencing reaction begins with a sequencing primer binding to a complementary single stranded DNA molecule to be sequenced
  • DNA polymerase is added, and each dNTP is added and removed from the reaction
  • When a base is added to the DNA being synthesized, PPi is released and subsequently converted to ATP by ATP sulfurylase in the presence of adenosine 5' phosphosulfate
  • The ATP is used in the conversion of luciferin to oxyluciferin, generating visible light proportional to the ATP concentration
  • Therefore, the amount of light produced in each reaction is proportional to the number of deoxynucleotides incorporated and the DNA sequence is read by comparing the light emitted to the dNTP added to each reaction
  • The light is usually measured by a photomultiplier tube, avalanche photodiode or charge coupled device camera
  • The key to this sequencing technique is removal of previously added dNTPs from the reaction, so that the effect of new dNTPs on light emission can be measured

Solid and liquid phase sequencing
  • Removal of previously added dNTPs is accomplished by both solid and liquid phase pyrosequencing
  • In solid phase pyrosequencing, a three enzyme mix (DNA polymerase, ATP sulfurylase, firefly luciferace) is used and the template is bound to a solid phase, such as a magnetic bead
  • After completing each dNTP reaction, the template is washed to remove nonincorporated deoxynucleotides and ATP resulting from the sulfurylase reaction
  • In liquid phase sequencing, a fourth enzyme, such as apyrase, is added to degrade the unreacted nucleotides and ATP
  • One problem with both solid and liquid phase pyrosequencing is the interference of the dATP in luminescence detection, because added dATP would react with luciferin to produce a photon
    • This problem was solved via replacing dATP with dATPαS - this allows efficient dATPαS incorporation by DNA polymerase while reducing photon production by luciferin, as dATPαS in not a substrate for this enzyme
  • The addition of single stranded DNA binding proteins to pyrosequencing reactions had been found to reduce nonspecific primer binding, thus reducing mispriming and increasing signal intensity, allowing for reading higher accuracy
    • Currently pyrosequencing allows 300 - 500 nucleotides stretches to be read, less than that of dideoxynucleotide chain terminating

Comparison with other methods
  • Advantages: does not require labeled primers or dNTPs or gel electrophoresis; is done in real time and is cost effective when compared to dideoxynucleotide chain terminating sequencing methods
  • Disadvantages: eventual loss of signal in the solid phase protocol due to loss of template and the eventual loss of apyrase activity in liquid phase sequencing due to the accumulation of intermediate products which inhibit this enzyme; the technique is less accurate with homopolymeric runs of over 5 - 6 nucleotides in a row
  • Pyrosequencing is usually used to analyze secondary DNA structures, to detect mtuations and to sequence short to medium length DNA segments
  • Parallel pyrosequencing platforms have been developed, such as the 454 Life Science, which can sequence entire bacterial genomes in 10 hours, demonstrating that pyrosequencing has significant research and biotechnology applications
Real time
  • VisiGen Biotechnologies has engineered a DNA polymerase which acts as a real time sensor of modified nucleotides
  • This method uses DNA synthesis, employing a DNA polymerase with a donor fluorescent dye close to the polymerase active site where nucleotides to be incorporated are selected
  • The four dNTPs are also modified, with each having a different acceptor dye compatible with the polymerase donor dye
  • At dNTP incorporation, the donor dye transfers fluorescent resonant energy to the acceptor dye, resulting in measurable light release
  • The four different acceptor dyes give off different wave length emissions, allowing the specific dNTP incorporated to be determined
  • The advantage of this system is that it can sequence DNA at the same rate at the polymerase incorporates dNTPs - several hundred moieties per second, with read lengths of around 1 kb, longer than possible with any current platform
  • This technology may eventually allow incredible sequencing feats, such a sequencing the entire human genome in one day for ~$1,000 (N Biotechnol 2009;25:195)
Roche 454 FLX pyrosequencer
Definition / general
  • The Roche 454 pyrosequencer was introduced in 2004 and for many applications, has replaced Sanger sequencing based instruments due to its high accuracy, low cost and relatively long reads
  • It uses pyrosequencing technology, initially fragmenting a genome by nebulization into 300 - 800 bp fragments which are then "polished" to produce blunt ends; the ends are ligated to two specific adaptor sequences, one with a 5' biotin tag for immobilization of the DNA onto 28 mm streptavidin coated beads
  • The DNA is separated into single strands and bound to beads under conditions which favor the binding of one DNA fragment per bead
  • The beads are compartmentalized in droplets of PCR reaction mixture in oil emersion and the PCR reactions occurs within each droplet, resulting in up to ten million DNA fragment copies bound to the bead
  • Next, the emulsion is broken, the DNA strands are denatured and each bead carrying many copies of the now single stranded DNA fragments are deposited into an individual well in a fiber optic slide
  • Each well has a diameter of about 44 mm and each slide has about 1.6 million wells
  • To reduce well to well "cross talk," usually not all the wells are filled; this reduces interference of light signals from adjacent wells, which would lower sequencing accuracy and efficiency
  • In general, enough wells are filled to allow about 400,000 parallel reads
  • Smaller beads carrying immobilized enzymes required for pyrosequencing are then deposited in each well
  • The DNA sequence is read as light pulses when the luciferin produced photon emissions are compared to the dNTP used in the reaction
  • The sequencing reaction is less efficient with homopolymer runs of six of more nucleotides
  • The 454 FLX Sequencer combines initial sequencing data, while screening out poor quality sequences, mixed sequences (i.e. more than one initial DNA fragment / bead) and sequences lacking the initial priming sequence used to create the expanded DNA sequence
  • The 454 FLX pyrosequencer can give 100 Mb of quality data, sufficient to sequence the smaller genomes of bacteria and viruses
  • The first large sequencing done with this platform was the 580,069 bp Mycoplasma genitalium genome in 2004 (Nature 2005;437:376)

Diagrams / tables

Images hosted on other servers:

Sample preparation

Sanger sequencing
  • DNA polymerase is used to extend a DNA strand until a chain elongation terminating radiolabeled or fluorescently labeled dideoxynucleotide triphosphate is incorporated
  • The DNA samples are divided into four separate sequencing reactions, each containing deoxynucleotides (dATP, dGTP, dCTP and dTTP), DNA polymerase, a single stranded DNA primer and low concentrations of labeled, chain terminating dideoxynucleotide (ddATP, ddGTP, ddCTP and ddTTP)
  • The dideoxynucleotides lack the 3-OH deoxyribose sugar, blocking further phosphodiester bond formation and subsequent chain elongation
  • Low dideoxynucleotide concentrations allow the formation of DNA fragments of varying length
  • Although DNA is most commonly labeled at the 3' end, it can also be labeled at the 5' end
  • Electrophoretic / autoradiographic analysis involves sequential reading of four lanes (A, T, G and C)

  • The shotgun sequencing and direct approaches are commonly used
  • In shotgun sequencing (random priming), genomic DNA is randomly fragmented into 2 - 3 kb fragments, inserted into a vector, and replicated via bacterial culture, leading to many different and overlapping genomic regions that can be sequenced and aligned to give the larger genomic sequence
  • Although the redundancy of this approach increases sequencing costs (because the same genomic areas may be sequenced 6 - 10 times), it has been used successfully in many applications, such as sequencing the Haemophilus influenzae genome
  • In the direct approach (primer walking), a section of DNA to be sequenced (usually ~40 kb) is placed next to a known sequence, with the primer lying within the known sequence
  • The DNA polymerase then extends the sequencing reaction into the unsequenced DNA
  • Once the sequence is obtained, a new primer is synthesized complementary to the distal portion of the newly sequenced DNA and the process is repeated until the entire genomic region is sequenced
  • The advantage if this process is that fewer genomic regions are sequenced multiple times; however each new round of sequencing requires a new primer to be synthesized, which increases the cost

  • Several different radiolabels have been employed in the Sanger method, including 32P, 35S and more recently 33P
    • 32P emits high energy b particles, giving a strong penetrating signal, resulting in diffuse autoradiographic bands, lowering the number of bands that may be read
    • 35S emits lower energy b particles, allowing more accurate and longer sequencing to be read via autoradiography
    • 33P has intermediate properties between 32P and 35S

  • Many different nonradioactive labels have been employed in the Sanger method, including fluorescent dyes 5-carboxyfluorescein, 7-nitrobenzo-2-oxa-1-diazole, tetramethylrhodamine and Texas Red
  • Nonradioactive dyes must:
    1. Have absorption and emission maxima in the visible spectrum
    2. Emit at wavelengths that allow spectroscopic separation
    3. Emit at a single wavelength
    4. Not affect the electrophoretic mobility of the DNA fragments
  • The advantage of using four different labels vs one radioisotope is that the entire sequencing reaction may be run in one reaction and analyzed in one gel lane or other single separation mechanism
  • Commonly employed DNA polymerases include DNA polymerase I Klenow fragment, AMV reverse transcriptase, Taq DNA polymerase and T7 DNA polymerase
  • Originally, Taq polymerase had a strong preference for incorporating ddNTPs over dNTPs but this problem was solved using a Taq polymerase with a single amino acid substitution, which resulted in the dNTP and ddNTP incorporation rate becoming more equal

  • Advantages of chain termination method: readily adaptable to commercial kits; use far fewer toxic reagents than the Maxam-Gilbert method
  • Disadvantages of chain termination method: nonspecific primer binding, less accurate read out, formation of DNA secondary structures which alter sequencing fidelity
  • After three decades of improvement, the Sanger based sequencing method has achieved read lengths of ~1,000 bp with capillary electrophoresis and per base sequencing accuracy as high as 99.999%
  • Current costs to sequence one kB of DNA is $0.50
Other innovations
  • Other innovations that improved sequencing technology were the implementation of parallel processing, automated parallel sample loading and automated sequence analysis
  • With parallel processing, multiple samples are loaded and separated via capillary electrophoresis simultaneously, allowing a vast increase in the amount of material that can be analyzed at once (i.e. high throughput analysis)
  • Current automated parallel sample loading allows many samples to be quickly and accurately loaded for capillary electrophoresis, bypassing the slow process of manually loading each sample as in slab gel electrophoresis
  • Similarly, the development of automated sequence analysis and the computer programs required to facilitate accurate analyses has allowed faster and more efficient data interpretation
  • The development of bioinformatics has been essential in allowing the sequencing and analysis of large DNA sequences, such as the human genome
    • Although earlier sequence analysis programs often had unsatisfactory base calling algorithms which required further human analysis, later analysis programs overcame these flaws and could analyze DNA sequences with a high degree of certainty
  • The overall effect of the above innovations was to increase the read length, speed, efficiency, accuracy and throughput of DNA sequencing, while lowering sequencing costs
Back to top
Image 01 Image 02