What happens between making a DNA library and screening by nucleic acid hybridization?

I understand how a plasmid library is made and how each clone is put on the nitrocellulose membrane and nucleic acid hybridization is done. What I don't understand is how each clone is "identified" and separately put on the membrane?

The library is 'created' during a ligation reaction: plasmid vector + insert (e.g. cDNA). That DNA is used to transform E. coli, and, after plating out the transformation mixture, single bacterial colonies are obtained. Each colony contains, to a first approximation, a single plasmid species (i.e with a single insert). So each colony represents a clone, in the terms that you use.

I don't know if anyone really does things like this any more, but originally each plate would be transferred to nitrocellulose, for probing, and to a fresh plate for archiving. This way, any colony identified as positive after nucleic acid hybridisation could be picked from the archive plate for further characterisation.

This screening can be done on large format Petri dishes to reduce the number of plates needed. I know of one lab which in the 1980s used to screen λ libraries after plating out for plaques on cafeteria trays. That was the heroic era of molecular biology.

Digoxigenin (DIG) Labeling Methods

The DIG System is the nonradioactive technology of choice to label and detect nucleic acids for multiple applications. The system is based on a steroid isolated from digitalis plants (Digitalis purpurea and Digitalis lanata). These plants are the only natural source of digoxigenin, so the anti-DIG antibody does not bind to other biological material, ensuring specific labeling. Due to this high specificity, less material is needed compared to radioactive labeling making the DIG system ideal for nucleic hybridization analysis. Immobilized nucleic acids are hybridized with a DIG-labeled probe and subsequent detection is performed using high affinity Anti-Digoxigenin antibodies, coupled either to alkaline phosphatase (AP), horseradish peroxidase (HRP), fluorescein or rhodamine for colorimetric, and chemiluminescent or fluorescent detection.

Figure 1. Example detecting DIG-labeled nucleic acids using chemiluminescence substrates.


Magnetic nanoparticles (MNPs) have the characteristics of both magnetic particles and nanoparticles (NPs), which refer to particles ranging between 1 and 100 nm that produce a response when presented in magnetic field. As the size of MNPs decrease, the ratio of surface area to volume increases allowing an increase with surface effect, small size effect, quantum size effect and macroscopic quantum tunneling effect. MNPs also exhibit coercivity changes and a Curie temperature decrease [1, 2].

MNPs are classified into metal NPs, metal oxide NPs and alloy NPs. Metal NPs includes iron, cobalt, and nickel [3]. Metal oxide NPs consist of iron oxides (γ-Fe2O3 and Fe3O4) and ferrites (CoFe2O4 and Mn0.6Zn0.4Fe2O4) [4]. Alloy NPs include FeCo and FePt [5].

Some magnetic nanoparticles have superparamagnetism, which refers to the state of MNPs when introduced to an external magnetic field, in which NPs react similarly to paramagnets with the difference being the higher level of attraction, hence “super”. All the MNPs are easily guided in the presence of external magnetic fields [6]. This is often used in materials science, electrochemistry, biochemical sensing, magnetic resonance imaging (MRI), environmental and medical research [7]. MNPs also play a significant role in removing pollutants and relieving toxicity, such as membrane separation for water treatment and purification. Researchers use MNPs to immobilize biomolecules (antibodies, proteins, enzymes, etc.) and utilize simple, rapid, cheap and efficient separation of target biomolecules [8,9,10,11,12,13,14,15]. Biomarkers in complex clinical samples can be preconcentrated and enriched, separating interfering matrices, and increasing the sensitivity and specificity of testing. Magnetic drug targeting utilizes drug targeted delivery in vivo through active targeted therapy strategy. In vitro detection and application of MNPs plays an important role in early rapid diagnosis and treatment of diseases, thus assisting in prevention, management, treatment and prognosis of diseases [16,17,18,19,20,21,22,23]. MNPs are also used for lateral flow test strips and microfluidic platforms such as lab-on-a-chip (LOC) devices for continuous flow of magnetic cell separation, and can be developed into portable devices that are easy to use. MNPs have attracted great interest of researchers due to their excellent properties [24,25,26,27,28,29,30,31].

Nucleic acid is one of the most basic substances in life and has extremely important biological functions. It mainly stores and transmits genetic information [32,33,34,35,36] which in turn, helps detect genetic changes that may be associated with certain health conditions [37,38,39,40,41]. The following is a brief review outlining the significance of nucleic acid detection in identifying important gene mutations for disease risk prediction, clinical treatment and prognosis evaluation [42,43,44,45,46,47,48]. Research shows that breast cancer susceptibility gene mutations are the main cause of breast cancer family clustering [49,50,51,52,53,54]. Berenstein [55] studies show that FLT3 gene mutation is one of the most frequent gene mutations in patients with Acute myeloid leukemia (AML), which can be seen in various subtypes of AMI. Kayser [56] found that C-KIT mutations are present in leukemia patients. Through gene sequencing, it can be seen that lung cancer-related genes such as epidermal growth factor receptor gene mutation, c-MET, ROS1, KRAS, and BRAF, play an extremely important role in diagnosis and treatment of lung cancer [57,58,59,60,61,62,63,64,65,66,67,68,69,70,71]. In addition to the above diseases, many diseases are related to gene mutations, such as Type 1 diabetes mellitus [72,73,74,75] and Type 2 diabetes mellitus [76,77,78,79,80,81], Lymphoma [82, 83], bowel cancer [84,85,86,87,88], and prostatic cancer [89,90,91]. Early nucleic acid detection is significant for identifying these gene mutations to prevent, treat, and identify disease prognosis [92].

To date, the gold standard for nucleic acid detection is polymerase chain reaction (PCR), however, this method is both time-consuming and laborious. PCR detection equipment is both large and expensive, and the operators need professional training. Therefore, it is especially important to develop fast and economical nucleic acid detection technology and equipment. The conventional nucleic acid extraction and isolation is a time-consuming process that goes through several centrifugation steps often resulting in low yield and purity. The above limitations restrict the application of such technologies for real time testing, limiting PCR to more developed central cities, hospitals and medical institutions in developed countries. PCR consumables and machines are often too expensive for low-level medical service platforms, especially in the vast rural areas and medical institutions in developing countries. In addition, the traditional PCR-based detection methods have been difficult to meet the increasing high-throughput demand in recent years for emergency treatment of sudden infectious diseases, clinical in vitro diagnosis, mobile medicine and other applications due to time restraints.

MNPs have high surface area to volume ratio, high binding rate with detection substances, and can perform magnetically controllable aggregation and dispersion, making preconcentration, purification and separation of nucleic acids simple and easy. MNPs have good dispersibility, which can bind biomolecules quickly and effectively. The binding is reversible, and the aggregation and dispersion of MNPs can be controlled. In the absence of an external magnetic field, the particles are nonmagnetic and are uniformly suspended in solution, while when an external magnetic field is used, the particles have magnetism and can be separated. Active substances such as bioactive adsorbents or other ligands connected to the surface of MNPs can be combined with specific biomolecules, like enzymes, DNA, proteins, and separated under the action of an external magnetic field. This method has high specificity, rapid separation and good reproducibility. MNPs have been applied in biosensors to improve the sensitivity of nucleic acid detection, and have been widely used in nucleic acid detection due to their excellent properties. A series of automatic detection instruments have been designed to utilize rapid and automatic detection of nucleic acid, which is of great significance in the medical field [93,94,95,96,97,98,99].

This article introduces the application of MNPs in nucleic acid extraction, target enrichment, infectious disease identification, site mutation detection, and library preparation for Next Generation Sequencing.

Choosing a Cloning Vector

Vectors must be relatively small molecules for convenience of manipulation. They must be capable of prolific replication in a living cell, thereby enabling the amplification of the inserted donor fragment. Another important requirement is that there be convenient restriction sites that can be used for insertion of the DNA to be cloned. Unique sites are most useful because then the insert can be targeted to one site in the vector. It is also important that there be a mechanism for easy identification and recovery of the recombinant molecule. There are numerous cloning vectors in current use, and the choice between them often depends on the size of the DNA segment that needs to be cloned. We will consider several commonly used types.


As described earlier, bacterial plasmids are small circular DNA molecules that are distinct from, as well as additional to, the main bacterial chromosome. They replicate their DNA independently of the bacterial chromosome. Many different types of plasmids have been found in bacteria. The distribution of any one plasmid within a species is generally sporadic some cells have the plasmid, whereas others do not. In Chapter 9, we encountered the F plasmid, which confers certain types of conjugative behavior to cells of E. coli. The F plasmid can be used as a vector for carrying large donor DNA inserts, as we shall see in Chapter 12. However, the plasmids that are routinely used as vectors are those that carry genes for drug resistance. The drug-resistance genes are useful because the drug-resistant phenotype can be used to select not only for cells transformed by plasmids, but also for vectors containing recombinant DNA. Plasmids are also an efficient means of amplifying cloned DNA because there are many copies per cell, as many as several hundred for some plasmids.

Two plasmid vectors that have been extensively used in genetics are shown in Figure 10-6 on the following page. These vectors are derived from natural plasmids, but both have been genetically modified for convenient use as recombinant DNA vectors. Plasmid pBR322 is simpler in structure it has two drug-resistance genes, tet R and amp R . Both genes contain unique restriction target sites that are useful in cloning. For example, donor DNA could be inserted into the tet R gene. A successful insertion will split and inactivate the tet R gene, which then will no longer confer tetracycline resistance, and the cell will be sensitive to that drug. Therefore, the cloning procedure would be to mix the samples of cut plasmid and donor DNA, transform bacteria, and select first for ampicillin-resistant colonies, which must have been successfully transformed by a plasmid molecule. Of the Amp R colonies, only those that prove to be tetracycline-sensitive have inserts in other words, the Amp R  Tet S colonies are the ones that contain recombinant DNA. Further experiments are needed to find the clones with the specific insert required.

Figure 10-6

Two plasmids designed as vectors for DNA cloning, showing general structure and restriction sites. Insertion into pBR322 is detected by inactivation of one drug-resistance gene (tet R ), indicated by the tet S (sensitive) phenotype. Insertion into pUC18 (more. )

The pUC plasmid is a more advanced vector, whose structure allows direct visual selection of colonies containing vectors with donor DNA inserts. The key element is a small part of the E. coli β-galactosidase gene. Into this region has been inserted a piece of DNA called a polylinker, or multiple cloning site, which contains many unique restriction target sites useful for inserting donor fragments. The polylinker is in frame translationally with the β-galactosidase fragment and does not interfere with its translation. The transformation protocol uses recipient cells that contain a β-galactosidase gene lacking the fragment present on the plasmid. An unusual type of complementation occurs in which the partial proteins coded by the two fragments unite to form a functional β-galactosidase. A colorless substrate for β-galactosidase called X-Gal is added to the medium, and the functional enzyme converts this substrate into a blue dye, which colors the colony blue. If donor DNA is inserted into the polylinker, the enzyme fragment borne on the vector is disrupted, no complete β-galactosidase protein is formed, and the colony is white. Hence, selection for white Amp R colonies selects directly for vectors bearing inserts, and such colonies are isolated for further study.

Plasmids that contain large inserts of foreign DNA tend to spontaneously lose the insert therefore, plasmids are not useful for cloning DNA fragments larger than 20 kb.

Phage lambda

Phage λ is a convenient cloning vector for several reasons. First, λ phage heads will selectively package a chromosome about 50 kb in length, and, as will be seen, this property can be used to select for λ molecules with inserts of donor DNA. The central part of the phage genome is not required for replication or packaging of λ DNA molecules in E. coli, so the central part can be cut out by using restriction enzymes and discarded. The two 𠇊rms” are ligated to restriction-digested donor DNA. The chimeric molecules can be either introduced into E. coli directly by transformation or packaged into phage heads in vitro. In the in vitro system, DNA and phage-head components are mixed together, and infective λ phages form spontaneously. In either method, recombinant molecules with 10- to 15-kb inserts are the ones that will be most effectively packaged into phage heads, because this size of insert substitutes for the deleted central part of the phage genome and brings the total molecule size to 50 kb. Therefore the presence of a phage plaque on the bacterial lawn automatically signals the presence of recombinant phage bearing an insert (Figure 10-7). A second useful property of a phage vector is that recombinant molecules are automatically packaged into infective phage particles, which can be conveniently stored and handled experimentally.

Figure 10-7

Cloning in phage λ. A nonessential central region of the phage chromosome is discarded and the ends ligated to random 15-kb fragments of donor DNA. A linear multimer (concatenate) forms, which is then stuffed into phage heads one monomer at a (more. )


Cosmids are vectors that are hybrids of λ phages and plasmids, and their DNA can replicate in the cell like that of a plasmid or be packaged like that of a phage. However, cosmids can carry DNA inserts about three times as large as those carried by λ itself (as large as about 45 kb). The key is that most of the λ phage structure has been deleted, but the signal sequences that promote phage-head stuffing (cos sites) remain. This modified structure enables phage heads to be stuffed with almost all donor DNA. Cosmid DNA can be packaged into phage particles by using the in vitro system. Cloning by cosmids is illustrated in Figure 10-8.

Figure 10-8

Cloning by cosmids. The cosmid is cut at a BglII site next to the cos site. Donor genomic DNA is cut by using Sau3A, which gives sticky ends compatible with BglII. A tandem array of donor and vector DNA results from mixing. Phage is packaged in vitro (more. )

Single-stranded phages

Some phages contain only single-stranded DNA molecules. On infection of bacteria, the single infecting strand is converted into a double-stranded replicative form, which can be isolated and used for cloning. The advantage of using these phages as cloning vectors is that single-stranded DNA is the very substrate required for the Sanger method DNA sequencing technique currently in widespread use (page 324). Phage M13 is the one most widely used for this purpose.

Expression vectors

One way of detecting a specific cloned gene is by detecting its protein product expressed in the bacterial cell. Therefore, in these cases, it is necessary to be able to express the gene in bacteria that is, to transcribe it and translate the mRNA into protein. Most cloning vectors do not permit expression of cloned genes, but such expression is possible if special vectors are used. However, because bacteria cannot process introns, the cloned sequences must be stripped of introns. The cloned gene is inserted next to appropriate bacterial transcription and translation start signals. Some expression vectors have been designed with restriction sites located just next to a lac regulatory region. These restriction sites permit foreign DNA to be spliced into the vector for expression under the control of the lac regulatory system.

[04:51] Question #16:

In prokaryotes, genes can exist as operons that are transcribed into a polycistronic mRNA containing multiple genes in a single transcript. In eukaryotes, transcripts exist only as monocistronic mRNA containing a single gene. What fundamental genetic difference is responsible for this distinction?

  • (A) mRNA is transported outside of the nucleus in eukaryotes.
  • (B) Prokaryotic mRNA has a five-prime GTP cap.
  • (C) Prokaryotes use a single start codon for multiple genes.
  • (D) In eukaryotes, each gene has its own transcription initiation site.

[05:47] Bryan’s Insights:

Answer choice (D) is the right answer. In eukaryotes, each gene has its own transcription initiation site, and so one mRNA is transcribed off of it for one gene.

A very common trap answer is choice (A), “mRNA is transported outside the nucleus in eukaryotes.” This is true—you move outside the nucleus to make a protein. Also, prokaryotes don’t have a nucleus, so that is a distinction between the two.

But just like we saw in Question #15, this is an example where the trap answer, although true, is not the answer to the question. The question was specifically getting at the “difference between mRNAs that are polycistronic versus monocistronic.” That is, how could you have multiple genes in one mRNA as opposed to a single gene in mRNA? And that has nothing to do with whether the mRNA is in the nucleus or not.

Again, a key lesson on the MCAT: Don’t pick an answer choice just because it’s true, pick it because it actually answers the question.

Moreover, when you’re reviewing your practice tests and you see a mistake like this, don’t brush it off. If you have that “it was a stupid mistake” kind of reaction, always dig in a little more. Find out why it was a stupid mistake, so you don’t do it again.

Author information


School of Chemistry, University of New South Wales Sydney, Sydney, New South Wales, Australia

Roya Tavallaie, Richard David Tilley, David Brynn Hibbert & John Justin Gooding

Australian Centre for NanoMedicine, University of New South Wales Sydney, Sydney, New South Wales, Australia

Roya Tavallaie, Joshua McCarroll, Richard David Tilley, Maria Kavallaris & John Justin Gooding

ARC Centre of Excellence in Convergent Bio-Nano Science and Technology, University of New South Wales Sydney, Sydney, New South Wales, Australia

Roya Tavallaie, Maria Kavallaris & John Justin Gooding

Tumour Biology and Targeting Program, Lowy Cancer Research Centre, Children’s Cancer Institute, University of New South Wales Sydney, Sydney, New South Wales, Australia

Joshua McCarroll, Marion Le Grand & Maria Kavallaris

School of Women’s and Children’s Health, Faculty of Medicine, UNSW Sydney, Sydney, New South Wales, Australia

Joshua McCarroll & Maria Kavallaris

Electron Microscope Unit, Mark Wainwright Analytical Centre, University of New I South Wales Sydney, Sydney, New South Wales, Australia

Nicholas Ariotti & Richard David Tilley

Analytical Chemistry – Center for Electrochemical Sciences (CES), Ruhr-Universität Bochum, Bochum, Germany

Department of Inorganic and Analytical Chemistry, University of Geneva, Geneva, Switzerland

Materials and Methods

ILPs were designed according to the recognition sites of the restriction endonuclease tested previously on the target genomic DNA (Each ILP should have a Tm value between 40ଌ and 50ଌ, and its secondary structure and self-hybridization score should below 50. For LNA Tm and spiked oligo hybridization tests, see for protocols). Usually, two or more ILPs could be used to generate products with different lengths. Probes for the suspension array typing assay were designed using a web-based software ( The ILPs and probes were synthesized and HPLC purified by IBA BioTAGnology (IBA GmbH, Göttingen, Germany). All probes were synthesized with an amino- and a carbon-linker modification at the 5′ terminus. Each probe was covalently coupled to carboxylated microspheres (QIAGEN) according to the manufacture's protocol. All primers used for testing the biased priming of ILPs were designed with Lasergene v6.1 software (DNAStar, Inc., Madison, WI, USA). They were designed against the conserved regions of bacterial genomes where no sequences of the applied ILPs could be found. The primers were synthesized by TaKaRa (TaKaRa, Dalian, China).

Sample DNAs from bacterial cultures (E. coli, K. oxytoca, and K. pneumoniae) were extracted and purified using DNeasy Blood & Tissue kit (QIAGEN). DNA concentrations quantified at OD260nm/280nm on a BioPhotometer Plus (Hamburg, Germany) (1ng, 10ng and 100ng) to circumvent any source of error.

Genomic DNA mixtures were amplified using a 7500 Real-time PCR system (Applied Biosystems) followed by suspension array-based typing. Each 25 μL reaction contained 1ൺmpliTaq Stoffel fragment polymerase and buffer (Applied Biosystems), 3 mM of MgCl2, 200 nM (each) of dATP, dCTP, dGTP (TaKaRa), 150 nM of Alexa Fluor 532-labeled dUTP (Molecular Probes), and 2.5 μM of each ILP . For monitoring of the ILP amplification kinetics, 1×SYBR Green I dye (Molecular Probes) and 1×ROX (Molecular Probes) were added, and 150 nM of dUTP purchased from TaKaRa was used. The cycling conditions were as follows: 5 min at 95ଌ 20 cycles of 1 min at 95ଌ, 1 min at 40ଌ, 1 min at 50ଌ, and 1 min at 72ଌ in 9600 emulation model, then, hold at 4ଌ.

The 20 μL reaction for the real-time PCR for the ILP priming bias test contained 1×SYBR-Taq mixture (Applied Biosystems), 0.2 μM of each primer, and 1 μL of each DNA template (genomic DNA or ILP products). The thermal cycling conditions were as follows: 95ଌ for 3min 40 cycles of 95ଌ for 15 sec, 58ଌ for 10 sec, and 72ଌ for 40 sec the fluorescence signals were read at the end of each cycle at 72ଌ . Dissociation test (Melting Curve) and agarose electrophoresis analysis were carried out after the 40 cycles were complete.

After a total of 20 cycles of ILP-PCR with the labeled dUTP, the products were mixed with 3,000 beads of each probe in a final volume of 50 μL. The mixture was incubated at 95ଌ for 5 min, followed by incubation at 50ଌ for 20 min. The mixture was then transferred to a 96-well filter plate. The beads were washed once and resuspended in SSC-Tween buffer. The fluorescent signals were detected according to the suspension array manufacturer's protocol.

Nucleic acid hybridization

Most techniques of eukaryotic gene analysis are based on nucleic acid hybridization. This technique involves annealing single-stranded pieces of RNA and DNA to allow complementary strands to form double-stranded hybrids. For example, if DNA is cut into small pieces and each piece dissociated into two single strands and denatured, each strand in the solution should find and reunite with its complementary partner, given sufficient time. The conditions of renaturation must be such that specific binding between complementary strands is maintained while nonspecific matchings are dissociated. This is usually achieved by varying the temperature or the ionic conditions in the solution in which renaturation is taking place (Wetmur and Davidson, 1968). Similarly, RNA synthesized from a particular region of DNA would be expected to bind to the strand from which it was transcribed (Figure 1). Thus, RNA is expected to hybridize specifically with a gene that encodes it. To measure this hybridization, one of the nucleic acid strands (the probe) is usually labeled by the incorporation of radioactive nucleotides. One technical problem that originally plagued nucleic acid hybridization studies was the difficulty of getting enough radioactivity into the RNA molecule. This problem is circumvented by isolating the RNA and making a complementary DNA (cDNA) copy in the presence of radioactive precursors. This can be done in a test tube containing the RNA, a short stretch of DNA (called a primer), radioactive DNA precursors, and the viral enzyme reverse transcriptase. This enzyme is capable of making DNA from an RNA template (Figure 2). Because the DNA is synthesized in vitro, one need not worry about the dilution of the radioactive precursors. Furthermore, the cDNA can hybridize with both the gene that produced the RNA (albeit the other strand) and the RNA itself, making it extremely useful in detecting small amounts of specific RNAs.

Figure 1 Nucleic acid hybridization. (A) If the DNA helix is separated into two strands, the strands should reanneal, given the appropriate ionic conditions and time. (B) Similarly, if DNA is separated into its two strands, RNA should be able to bind to the genes that encode it. If present in sufficiently large amounts compared with the DNA, the RNA will replace one of the DNA strands in this region. Figure 2 Method for preparing complementary DNA (cDNA). Most mRNA contains a long stretch of adenosine residues (AAAn) at the 3&prime end of the message therefore, investigators anneal a primer consisting of 15 deoxythymidine residues (dT15) to the 3&prime end of the message. Reverse transcriptase then transcribes a complementary DNA strand, starting at the dT15 primer. The cDNA can be isolated by raising the pH of the solution, thereby denaturing the double-stranded hybrid and cleaving the RNA.

Cloning from genomic DNA As early as 1904, Theodor Boveri despaired that the techniques of his time might never allow him to study how genes create embryos. A particular type of gene amplification technique was needed:

For it is not cell nuclei, not even individual chromosomes, but certain parts of certain chromosomes from certain cells that must be isolated and collected in enormous quantities for analysis that would be the precondition for placing the chemist in such a position as would allow him to analyse [the hereditary material] more minutely than the morphologists.

However, since the 1970s, nucleic acid hybridization has enabled developmental biologists to do just what Boveri wanted: to isolate and amplify specific regions of the chromosome. The main technique for isolating and amplifying individual genes is called gene cloning. The first step in this process involves cutting nuclear DNA into discrete pieces by incubating the DNA with a restriction endonuclease (more commonly called a restriction enzyme). These endonucleases are usually bacterial enzymes that recognize specific sequences of DNA and cleave the DNA at these sites (Nathans and Smith 1975). For example, when human DNA is incubated with the enzyme BamHI (from Bacillus amyloliquifaciens strain H), the DNA is cleaved at every site where the sequence GGATCC occurs. The products are variously sized pieces of DNA, all ending with G on one end and GATCC on the other (Figure 3). These pieces are often called restriction fragments.

Figure 3 The general protocol for cloning DNA, using as an example the insertion of a human DNA sequence into a plasmid with one BamHI-sensitive site.

The next step in gene cloning is to incorporate these restriction fragments into cloning vectors. These vectors are usually circular DNA molecules that replicate in bacterial cells independently of the bacterial chromosome. Either drug-resistant plasmids or specially modified viruses (which are especially useful for cloning large DNA fragments) are used. For instance, a vector can be constructed to have only one BamHI-sensitive site. This vector can be opened by incubating it with that restriction enzyme. After being opened, it can be mixed with the BamHI-fragmented human DNA. In numerous cases, the cut DNA pieces will become incorporated into these vectors (because their ends are complementary to the vectoris open ends), and the pieces can be joined covalently by placing them in a solution containing the enzyme DNA ligase. The whole process yields bacterial plasmids that each contain a single piece of human DNA. These are called recombinant plasmids or, usually, recombinant DNA (Cohen et al., 1973 Blattner et al., 1978).The plasmid illustrated in Figure 3 is pUC18, a cloning vector often used by molecular biologists (Vierra and Messing, 1982). It contains (1) a drug-resistance gene, Ap R , which makes the bacterium immune to ampicillin and allows researchers to select for those bacteria that have incorporated a plasmid (2) an origin of DNA replication that enables the plasmid to replicate hundreds of times in each bacterium and (3) a polylinker, a short, artificial stretch of DNA that contains the restriction enzyme sites for several of these endonucleases. The polylinker resides within a lacZ gene that encodes E. coli b-galactosidase. The polylinker is short enough (and has the correct number of base pairs) so that it does not interfere with the enzymatic activity of the b-galactosidase. The cloning procedure begins when the restriction fragments of the nuclear DNA are mixed with the opened pUC18 plasmids and are then ligated shut. The putative recombinant plasmids made in this manner are then incubated with ampicillin-sensitive E. coli cells that lack a b-galactosidase gene. Even though the bacteria and the plasmids are mixed together under conditions that encourage the bacteria to take in plasmids, not every bacterium incorporates a plasmid. To screen for those bacteria that have incorporated plasmids, the treated E. coli cells are grown on agar containing ampicillin. Only those bacteria that have incorporated a plasmid (with its dominant ampicillin-resistance gene) survive.

But not every plasmid has incorporated a foreign gene, because it is possible for the "sticky ends" of the restriction enzyme site to renature with themselves. To distinguish bacterial colonies that have incorporated foreign DNA from those that have not, the agar also contains a dye called X-gal. This compound is colorless, but when acted upon by b-galactosidase, it forms a blue precipitate. [i] Thus, if a plasmid has not incorporated a restriction fragment into its restriction enzyme site in the polylinker, the b-galactosidase (lacZ) gene is functional, and the resulting b-galactosidase turns the dye blue. The result is the appearance of "blue colonies." However, if the plasmid has taken up a DNA fragment, the b-galactosidase gene is destroyed by the insertion. These bacteria will not turn the dye blue they produce colorless colonies on the agar. Colorless colonies are then screened for the presence of the particular gene. Cells from each of these colonies are placed on a paper-thin nitrocellulose or nylon filter. When these cells are lysed, their DNA gets stuck on the filter. Next, the DNA strands are separated by heating, and the filter is incubated in a solution containing the radioactive RNA (or its cDNA copy) of the gene one wishes to clone. (In some cases, the sequence of the mRNA or gene is not yet known, and one has to guess the sequence from the amino acid sequence of the protein.) If a plasmid contains that gene, its DNA should be on the filter, and only that DNA should be able to bind the radioactive RNA or cDNA probe. Therefore, only those areas will be radioactive. The radioactivity of these regions is detected by autoradiography. Sensitive X-ray film is placed over the treated paper. The high-energy electrons emitted by the radioactive RNA sensitize the silver grains in the film, causing them to turn dark when the film is developed. Eventually, a black spot is produced over each colony containing the recombinant plasmid carrying that particular gene (see Figure 3). This colony is then isolated and grown, producing billions of bacteria, each containing hundreds of identical recombinant plasmids.

The recombinant plasmids can be separated from the E. coli chromosome by centrifugation, and incubating the plasmid DNA with BamHI releases the foreign DNA fragment that contains the gene. This fragment can then be separated from the plasmid DNA, so the investigator has micrograms of purified DNA sequences containing a specific gene. Although this procedure sounds very logical and easy, the number of colonies that must be screened is often astronomical. The number of random fragments that must be cloned to obtain the gene we want gets larger with the increasing complexity of the organismis genome. [ii] To detect a particular gene from a mammalian genome, millions of individual clones must be screened.

DNA hybridization: Within and across species

Clones can be screened by any radioactive stretch of nucleotides. Therefore, the genes cloned from one organism can be probed with radioactive cDNAs derived from the mRNAs of another species. One of the most exciting findings of modern developmental biology has been that genes used for specific developmental processes in one organism may be used for similar processes in other organisms. Drosophila has been critical in the discovery of these genes. Starting with Morgan, these genes have been mapped, and in the 1960s, E. B. Lewis confirmed that some of these genes are responsible for the formation of basic body parts. One of these, Antennapedia, is a gene whose protein product is essential for inhibiting head structures from forming in the thorax. If the gene is missing, antennae grow where the legs should be. If the gene is expressed in the head (as it is in a particular mutant), the fly develops an extra set of legs coming out of its eye sockets. Could such a gene exist in vertebrates?

Evidence for such genes in vertebrates came first from DNA blots, sometimes called Southern blots after their inventor, E. M. Southern (1975). DNA from numerous vertebrate and invertebrate organisms was treated with a restriction enzyme, and the resulting DNA fragments were separated on an electrophoresis gel. The mixtures of fragments were placed into slots on one side of a gel, and an electric current was passed through the gel. The negatively charged DNA fragments migrated toward the positive pole, the smaller fragments moving faster than the larger ones. [iii] However, hybridization cannot be done inside a gel the DNA must be transferred to a flat surface, and this is done by blotting. After denaturing the DNA strands in alkali, investigators returned the gel to a neutral pH, then placed it on wet filter paper atop a plastic support (McGinnis et al., 1984 Holland and Hogan, 1986). Nitrocellulose paper (capable of binding single-stranded DNA) was placed directly over the gel and covered with multiple layers of dry paper towels. The filter paper beneath the gel extended into a trough of high-ionic-strength buffer. The buffer traveled through the gel up through the nitrocellulose filter and into the towels. The DNA was brought up through the gel by this flow of buffer, but it was stopped by the nitrocellulose filter thus, the DNA was transferred from the gel to the nitrocellulose paper. After baking the DNA fragments onto the nitrocellulose paper (otherwise they would have come off), the DNA fragments were incubated with radioactive cDNA from a portion of the Drosophila Antennapediagene. An autoradiogram of the nitrocellulose paper showed where the radioactive DNA had found its match. The results from these experiments (Figure 5) showed that even vertebrates (mice, humans, and chicks) have genes that hybridize to these sequences. This radioactive section of the Antennapedia gene was then used to screen a genomic library of DNA clones derived from the genome of these different species. Investigators found clones containing genes that resemble Antennapedia, and these genes were revealed to be extremely important in the formation of the vertebrate body axis.

Figure 4 Southern blots of various organisms&rsquo DNA using a radioactive probe from the Antennapedia gene of Drosophila melanogaster. Because we do not expect the sequences between such diverged species to be perfectly identical, the stringency of the hybridization is lowered by changing the salt conditions. (Such low-stringency blots across phyla are colloquially refered to as "zoo blots," for obvious reasons.) Autoradiography shows that Drosophila genes contain several portions that are like Antennapedia genes in structure and that many organisms contain several genes that will hybridize this radioactive gene fragment, suggesting that Antennapedia-like genes exist in these organisms. The numbers beside the blots indicate size of bands, in kilobases. (From McGinnis et al., 1984, courtesy of W. McGinnis.)

DNA sequencing

Sequence data can tell us the structure of the encoded protein and can identify regulatory DNA sequences that certain genes have in common. The simplicity of the Sanger "dideoxy" sequencing technique (Sanger et al., 1977) has made it a standard procedure in many molecular biology laboratories. One starts with the vector carrying the cloned gene and isolates a single strand of the circular DNA. One then anneals a radioactive primer of DNA (about 20 base pairs) complementary to the vector DNA immediately 3Y to the cloned gene. (Because these vector sequences are known, oligonucleotide primers can be readily synthesized or purchased commercially.) The primer has a free 3Y end to which more nucleotides can be added. One places the primed DNA and all four deoxyribonucleoside triphosphates into four test tubes. Each of the test tubes contains the polymerizing subunit of DNA polymerase and a different dideoxynucleoside triphosphate: one tube contains dideoxy-G, one tube contains dideoxy-A, and so forth. The structures of the deoxynucleotides and the dideoxynucleotides are shown in Figure 5. Whereas a deoxyribonucleotide has no hydroxyl (OH) group on the 2Y carbon of its sugar, a dideoxyribonucleotide lacks hydroxyl groups on both the 2Y and 3Y carbons. So even though a dideoxyribonucleotide can be bound to a growing chain of DNA by DNA polymerase, it stops the chainis growth because, lacking a 3Y hydroxyl group, no new nucleotide can bind to it. Thus, when the DNA polymerase is synthesizing DNA from the primer, the new DNA will be complementary to the cloned gene. In the tube with dideoxy-A, however, every time the polymerase puts an A into the growing chain, there is a chance that the dideoxy-A will be placed there instead of the deoxy-A. If this happens, the chain stops. Similarly, in the tube with dideoxy-G, the chain has the potential to stop every time a G is inserted. (The process has been likened to a Greek folk dance in which some small percentage of the potential dancers have one arm in a sling.) Because there are millions of chains being made in each tube, each tube will contain a population of chains, some stopped at the first possible site, some at the last, and some at sites in between. The tube with dideoxy-A, for instance, will contain chains of different discrete lengths, each ending at an A residue. The resulting radioactive DNA fragments are separated by electrophoresis. The result is a "ladder" of fragments wherein each "rung" is a nucleotide sequence of a different length. By reading up the ladders, one obtains the DNA sequence complementary to that of the cloned gene.

Figure 5 Comparison of deoxynucleotides and dideoxynucleotides. (A) Structures of the two types of nucleotides. The difference is highlighted. (B) The 3&prime end of a chain that has been terminated by incorporation of dideoxynucleotide because it has no free 3&prime hydroxyl group for further DNA polymerization.

Analyzing mRNA through cDNA libraries

Now we can return to the specificity of mRNA transcription: Can we isolate populations of mRNA that characterize certain cell types and are absent in all others? To find these RNAs, we can "clone" the mRNA from different types of cells and compare them. As shown in Figure 6A, this is done by taking the messenger RNAs from a cell or tissue and converting them into complementary DNA strands. By taking the procedure a step further (with the aid of DNA polymerase and S1 nuclease), we can change this population of single-stranded cDNA into a population of double-stranded cDNA pieces. These strands of DNA can be inserted into plasmids by adding the appropriate "ends" onto them with DNA ligase. Appending a GATCC/G fragment onto the blunt ends of such a DNA piece creates an artificial BamHI restriction cut and enables the piece to be inserted into a virus or plasmid cut with that enzyme (Figure 6B).

Such collections of clones derived from mRNAs are often called libraries. Thus, we can have a 16-day embryonic mouse liver library, representing all the genes active in making embryonic liver proteins. We can also have a Xenopus vegetal oocyte library, representing messages present only in a particular part of that cell. Genes cloned in this manner are very important because they lack introns. When added to bacterial cells, these genes can be transcribed and then translated into the proteins they encode.

Libraries have been extremely useful in studying development, as seen in the efforts of Wessel and co-workers (1989) to detect differences in the RNAs in different parts of the gastrulating sea urchin embryo. To find endoderm-specific mRNAs in sea urchins, Wessel and co-workers prepared a cDNA library from gastrulating embryos. The mRNA of these samples (most of the RNA of eukaryotic cells is ribosomal) was isolated by running the samples through oligo-dT beads that capture the poly(A) tails of the messages (see legend to Figure 3). Then the mRNA population was converted into a cDNA population by using reverse transcriptase (see Figure 6A). By using E. coli polymerase I, the single-stranded cDNA was then made double-stranded. Next, commercially available EcoRI "ends" were ligated onto the double-stranded cDNAs. This made them clonable into vectors that were cut with EcoRI restriction enzyme. The DNA was then mixed with the arms of a genetically modified l phage (see Figure 6B). This phage is so constructed that when grown in a petri dish, the phages that have incorporated the DNA (and thus destroyed the b-galactosidase gene) produce colorless plaques (Figure 6C). In this way, approximately 4 million recombinant phages were generated, each containing a cDNA representing an mRNA molecule.

The next steps involved screening the recombinant phages. Which ones might represent mRNAs found in endoderm and not in the other cell layers? Wessel and his colleagues isolated the mRNA populations from mesoderm, ectoderm, and endoderm. They then made labeled cDNAs from each of the mRNA populations using radioactive precursors. They now had three collections of radioactive cDNA molecules, each representing the mRNA population from one of the three germ layers.

The recombinant phages representing the mRNAs of the gastrulating sea urchin embryo were grown, and samples of numerous coloniesneach containing thousands of phagesnwere placed on two nitrocellulose filters (Figure 6D). These samples were then placed in alkaline solutions to lyse the phages and make the DNA single-stranded. One of these filter papers was incubated with radioactive cDNA made from the total mRNA of the endoderm the other paper was incubated with radioactive probes to both mesoderm and ectoderm. The filters were then washed to remove any unhybridized radioactive cDNA, dried, and exposed on X-ray film. If an mRNA were present in the endoderm but not in either the ectoderm or mesoderm, the recombinant DNA made from that message should bind radioactive cDNA from the endoderm but should not find an mRNA anywhere else. As a result, that spot of recombinant DNA from the endoderm should be radioactive (since it bound radioactive cDNA from the endoderm), but the same clone should not be radioactive when exposed to ectodermal or mesodermal mRNA. This was found to be the case. One recombinant phage in particular only bound radioactive cDNA made from endodermal mRNA hence, it represented an mRNA found in the endoderm and not in the mesoderm or ectoderm. The phage containing this gene can now be grown in large quantities and characterized.

Figure 6 Protocol used to make cDNA libraries. (A) Messenger RNA is isolated and made into cDNA. This cDNA is made double-stranded, and restriction fragment ends are added. (B) The cDNA "genes" can then be inserted into specially modified vectors, in this case bacteriophages. (C) Phages containing the recombinant DNA will lyse E. coli, forming plaques. Biochemical techniques can distinguish plaques of recombinant phages from those that lack the inserted gene. (D) The plaques are transferred to nitrocellulose paper and treated with alkali to lyse the phages and denature the DNA in place. These filters are then incubated in radioactive probes (usually cDNA) from a tissue. For the differential cDNA library screening discussed in the text, the same phage library was screened with radioactive probes from two different tissues, allowing the researchers to look for an mRNA that would be found in one type of tissue but not in the other.

Materials and Methods


The 3-E7 antibody was purchased from Gramsch Laboratories (Schwabhausen, Germany). PANSORBIN cells were purchased from Calbiochem (San Diego, California, United States). BSA was purchased from New England Biolabs (Beverly, Massachusetts, United States). Yeast tRNA was purchased from Ambion (#7119, Austin, Texas, United States). The [Leu]enkephalin peptide and all oligonucleotides were purchased from the Stanford PAN Facility (Stanford, California, United States).


Solid-phase peptide synthesis was carried out as previously described (Halpin et al. 2004). 5′ amino-modified ssDNA (#10-1912-90, #10-1905-90, #10-1918-90, Glen Research, Sterling, Virginia, United States) was noncovalently bound to DEAE Sepharose Fast Flow resin (# 17-0709-01, Pharmacia-LKB Technology, Uppsala, Sweden) packed into TWIST column housings (Glen Research #20-0030-00). DNA was loaded onto the columns in 10 mM acetic acid, 0.005% Triton X-100 buffer. To accomplish amino acid additions, columns were washed with 3 ml of DMF and subsequently incubated with 62.5 mg/ml Fmoc succinimidyl esters in 300 μl of coupling solvent (22.5% water, 2.5% DIEA, and 75% DMF) for 5 min. Excess reagent was washed away with 3 ml DMF, and the coupling procedure was repeated. The Fmoc-protecting group was then removed by two 1-ml treatments with 20% piperdine in DMF, one for 3 min and one for 17 min (Carpino and Han 1970). Finally, the columns were washed with 3 ml of DMF followed by 3 ml of DEAE Bind Buffer (10 mM acetic acid, 0.005% Triton X-100). Anhydride couplings followed the same procedure except that a 3-ml water wash was added after DNA loading to remove remaining acetic acid. Columns were incubated with 10 mM of each anhydride (100 mM for trimethylacetic anhydride) in 500 μl of DMF for 30 min. 20-base oligonucleotide–peptide conjugates were eluted off DEAE columns with 2 ml of DEAE Elute Buffer (1.5 M NaCl, 50 mM Tris pH 8.0, and 0.005% Triton X-100). 340-base ssDNA-peptide conjugates were eluted with 2 ml of Basic Elute Buffer (1.5 M NaCl, 10 mM NaOH, and 0.005% Triton X-100) heated to 80 °C. For synthesis of libraries, a 2-ml PBS wash was added at the end of each amino acid coupling step to remove remaining anionic reagents. Following the last coupling step in the library synthesis, free oligonucleotides were separated from 340-base DNA supports by washing with 2 ml of DEAE Elute Buffer.

Electromobility shift assay.

The electromobility shift assay was performed as previously described (Hwang et al. 1999). No plasmid DNA was added to the samples. Antibody 3-E7 (0.5 μg) was added to the “antibody plus” samples. Samples were run on a 2% NuSieve (#50081, FMC Bioproducts, Rockland, Maryland, United States) agarose gel for 1 h at 100 V in TBE. 840 μM peptide was used to compete away binding to the peptide–DNA conjugate.


ssDNA was converted to double-stranded DNA (dsDNA) by one-cycle PCR with a single end primer. The 50-μl PCR reaction contained 20 μM primer, 200 μM of each dNTP, 5 mM MgCl2, 1X Promega Taq reaction buffer, and 5 U of Taq DNA polymerase (#M1661, Promega, Madison, Wisconsin, United States). The PCR program was 94 °C for 2.5 min, 58 °C for 1 min, and 72 °C for 15 min. The dsDNA–peptide conjugates were incubated with PANSORBIN cells in 50 μl of Selection Buffer (TBS, 0.1% BSA, and 0.1 μg/μl yeast tRNA) at 4 °C for 1 h to preclear conjugates that nonspecifically bind to the cells. Then, preclear beads were pelleted by centrifugation and removed. Antibody 3-E7 (0.5 μg) was added to the supernatant and allowed to incubate for 1 h at 4 °C. The solution was then mixed with fresh PANSORBIN cells for 1 h at 4 °C. The cells were pelleted and washed at 25 °C three times with 500 μl of Wash Buffer (TBS, 0.1% BSA, 0.1 μg/μl tRNA, and 350 mM NaCl), followed by a single wash with 500 μl of Selection Buffer. The dsDNA–peptide conjugates were eluted by incubation of the cells with 50 μl of 200 μM [Leu]enkephalin in Selection Buffer for 1 h at 25 °C. Selected genes were amplified from 10 μl of elute supernatant with 25-cycle PCR reactions.


High performance liquid chromatography analysis of DNA–peptide conjugates, synthesis of anticodon columns, hybridization and transfer of DNA, library assembly, ssDNA generation, and library isolation were performed as previously described (Halpin and Harbury 2004 Halpin et al. 2004).


DNA templates for XACTLY experiments

Synthetic control oligos

Synthetic control oligos ( Supplementary Table S1 ) were designed using a random sequence generator at 50% GC content. Sequences matching any known organism in public databases were removed. Each control molecule (n = 12) is a unique 50 bp sequence of double-stranded DNA with one blunt-end, and one 3′ or 5′ single-stranded overhang of random sequence, 1–6 nucleotides in length. Because each control is a unique sequence, it serves as its own barcode indicating the structure of the oligo. Oligos were synthesized using standard desalting purification and duplexed by Integrated DNA Technologies (IDT) all random nucleotides were ‘hand-mixed’ to reduce synthesis bias. Control oligos were pooled together in an equimolar ratio.

NA12878 genomic DNA (gDNA)

NA12878 gDNA was purchased from the Coriell Institute for Medical Research, was prepared for XACTLY ligation in several ways. Mechanical shearing: NA12878 was sheared to an average length of 350 bp using a Bioruptor Pico (Diagenode) and manufacturer's instructions. Sheared DNA was then size selected from 200 to 600 bp using a Pippen Prep dye free 2% gel (Sage Sciences) following manufacturer's instructions. Restriction enzyme digest: 1 μg of NA12878 was digested in a 50 μl reaction using 10 units of MluCI (New England Biolabs) at 37°C for 1 h. Digested DNA was purified using 2× AMPure beads (Beckman Coulter) following manufacturer's instructions. After purification DNA was size-selected from 200 to 600 bp using a Pippen Prep dye free 2% gel (Sage Sciences) and manufacturer's instructions. Enzymatic shearing: DNase I: 1 μg of NA12878 was digested in a 50 μl reaction using 0.01 units of DNase I (New England Biolabs) at 37°C for 10 min and stopped with 0.1 mM EDTA DNA was purified as above. Micrococcal nuclease: 1 μg of NA12878 was digested in a 50 μl reaction using 2 units of Micrococcal nuclease (New England Biolabs) at 37°C for 5 min and stopped with 0.1 mM EDTA DNA was purified as above.

Human plasma extraction and cell-free DNA preparation

Whole blood from deidentified donors was obtained for in vitro investigational use from the Stanford Blood Center in Palo Alto, CA. Blood was drawn into one of several tube types ( Supplementary Table S2 ). Blood plasma was extracted from whole blood by spinning the blood collection tubes at 1800 g for 10 min at 4°C. Without disturbing the cell layer, the supernatant was transferred to microcentrifuge tubes under sterile conditions in 2 ml aliquots and spun again at 16 000 g for 10 min at 4°C to remove cell debris and stored at −80°C as 1 ml aliquots. cfDNA was extracted from 1 ml plasma using the QiaAmp ccfDNA kit (Qiagen) following manufacturer's protocol. Purified cfDNA was measured for double-stranded DNA (dsDNA) concentration using the Quant-iT high sensitivity dsDNA Assay Kit and a Qubit Fluorometer (ThermoFisher). Purified cfDNA was analyzed for size distribution using the Agilent TapeStation 4200 and associated D1000 and D5000 high sensitivity products. Cell-free DNA was prepared for XACTLY ligation by dephosphorylation followed by 5′ phosphorylation using the protocol detailed below (Preparing DNA termini for adapter ligation).

Control oligo-blood spike experiments

Approximately 40 ml of whole blood was collected in four blood collection tubes (10 ml each) from a single donor ( Supplementary Table S2 ). Blood from each collection tube was divided into three equal aliquots. To evaluate the effect of blood nucleases on DNA termini, a pool of our control oligos (1 pmol total per ml of whole blood) was added to aliquoted blood collection tubes under sterile conditions. In the case of serum tubes, because coagulation initiates from the time of blood draw, the clot was separated at the start of the experiment and the control oligo pool was added to 1 ml of the supernatant prior to serum preparation. Water and 1X PBS pH7.4 were used as negative controls, substituting for whole blood. The blood product-oligo mixtures (and negative controls) were incubated for 0, 4 or 24 h. Immediately following each time point, blood plasma extraction was performed as above. cfDNA extractions were performed from each spiked plasma sample using the Qiagen QiaAmp ccfDNA kit. The bead binding buffer, proteinase K and magnetic bead volumes were scaled according to the input plasma volume. DNA termini preparation of control-spiked cfDNA was performed as described below, followed by XACTLY adapter ligation, nick repair and amplification.

Ethics statement

This work is not considered human subjects research under the HHS human subjects regulations (45 CFR Part 46).

XACTLY library preparation

Preparing XACTLY sequencing adapters

Each XACTLY adapter contains Illumina sequencer-specific priming sites and a Unique-End-Identifier (UEI)—a barcode sequence that indicates the length and identity (5′ or 3′) of the overhang, if any, present in the original molecule ( Supplementary Table S3 ). The XACTLY adapters were synthesized using standard desalting purification and duplexed by Integrated DNA Technologies (IDT). For purposes of this study the 13 XACTLY adapters include six with 3′ overhangs (1–6 nt in length), six with 5′ overhangs (1–6 nt in length), and a single blunt adapter (i.e. no overhang). XACTLY adapters were not phosphorylated and thus are discouraged from forming dimers. All 13 duplexed XACTLY adapters were pooled in equimolar ratio and prepared for ligation by terminal dephosphorylation using the following 20 μl reaction: 1pmol of pooled XACTLY adapters, 10 units of rapid Shrimp Alkaline Phosphatase (New England Biolabs), 1× CutSmart Buffer, incubated at 37°C for 30 min followed by a 10-min heat inactivation at 65°C. Multiple dephosphorylation reactions were combined over a single QIAquick Nucleotide Removal column (Qiagen) and purified according to manufacturer's instructions. XACTLY adapter molarity was calculated using DNA concentration (Qubit Fluorometric Quantitation) and double-stranded base pair length. Purified XACTLY adapters could be used directly and/or stored at −20°C.

Preparing template DNA termini for adapter ligation

The termini of template DNA molecules, including control oligos, were prepared for adapter ligation. Up to 1 pmol, DNA ends were dephosphorylated in a 20 μl reaction using rapid Shrimp Alkaline Phosphatase (rSAP) (New England Biolabs) and 1× CutSmart buffer incubated at 37°C for 30 min followed by a 10-min heat inactivation at 65°C. DNA was then 5′ phosphorylated by bringing the heat-inactivated 20 μl rSAP reaction up to 40 μl using 20 units of T4 Polynucleotide Kinase (New England Biolabs), and a 5% final concentration of PEG 8000. The phosphorylation reaction was carried out at 37°C for 30 min followed by a 30-min heat inactivation step at 65°C.

Adapter ligation and nick repair

XACTLY ligation consisted of an initial ligation step and a subsequent nick repair ligation step prior to standard NGS library amplification and indexing. First, 0.05 pmol of substrate DNA (control/NA12878/cfDNA) was combined with 1 pmol of XACTLY adapters in a 60 μl ligation reaction with 800 units of T4 DNA ligase (New England Biolabs) and 1× T4 DNA Ligase Buffer, and incubated at 20°C for 1 h, followed by either a 2× AMPure clean for control oligos, or a 1.2× AMPure clean for NA12878 or cfDNA. After DNA purification, DNA was again phosphorylated with 20 units of T4 Polynucleotide Kinase (New England Biolabs) and 1× T4 DNA ligase buffer in a 48.8 μl reaction and incubated at 37°C. After 30 min, 480 units of T4 DNA ligase was added to the reaction and the temperature reduced to 20°C for 15 min. Nick repair was followed by a 2× AMPure bead clean and elution in 20 μl of low TE (10 mM Tris pH 8, 0.1 mM EDTA).

Library amplification and indexing for Illumina sequencing

For Index PCR, 10 μl of purified XACTLY ligated DNA was combined with 1× Kapa HiFi HotStart ReadyMix (Kapa Biosystems) and 0.4 mM final concentrations of the Illumina-compatible IS4 primer and a single indexing primer, as described in ( 7), in a 50 μl reaction and amplified using the following thermal cycling conditions: 3 min at 98°C for initial denaturation followed by 15 cycles for control/NA12878 or 18 cycles for cfDNA at 98°C for 20 s, 68°C for 30 s, 72°C for 30 s, and finally an elongation step of 1 min at 72°C. After index PCR, DNA was purified with either a 1.5× AMPure clean for control oligos, or a 1.2× AMPure clean for NA12878/cfDNA. For each sequencing DNA library, final molarity estimates were calculated using fragment length distribution and dsDNA concentration (Agilent Tapestation 4200 and Qubit Fluorometric Quantitation unit). Samples were then pooled and run 2 × 150 bp cycles on an Illumina MiSeq benchtop sequencer (following manufacturer's instructions) to a depth of ∼100 000 read-pairs per sample. Step-by-step instructions for XACTLY library preparation method are detailed in the Supplemental Protocol.

Informatic analysis

Read processing

Mapping UEI-barcoded read pairs poses a bioinformatic challenge when template molecules, plus the 7-nt UEIs, are shorter than the sum of the lengths of the forward and reverse reads. The challenge of mapping short fragments exists because each read can extend through its mate's UEI sequence and possibly beyond into the Illumina adapter sequence. Standard practice in studies where short template molecules are expected, such as in the field of ancient DNA, is simultaneously to remove adapter sequences and merge reads ( 20). Specifically, the process collapses forward and reverse reads into single sequences based on sequence similarity and a minimum amount of overlap while trimming ends of reads that match known Illumina adapter sequences (see SeqPrep When UEIs are present, however, a merged read that is shorter than or equal to the read length will have a 7-nt UEI on both ends, one of which will be reverse-complemented. The reverse-complemented UEI from R2 has the potential to interfere with read mapping. For this reason, we truncated each forward and reverse read wherever its mate's reverse-complemented UEI sequence was found.

For each read, we first checked for the presence of a known UEI at the start of the forward and reverse read in each pair. UEIs were allowed to contain up to one ‘N’ base, but no other base mismatches were allowed. If both reads had a known UEI sequence, we then checked whether reads merged by searching each sequence for the reverse complement of its mate's UEI. If neither read met this criterion, both reads were output unchanged, since a read can only include adapter sequence if it extends through its mate's UEI sequence. If both reads contained their mate's reverse-complemented UEI sequence, and the positions at which the mates’ UEIs were encountered matched, then both reads were truncated at that position. If the positions did not match, indicating an artifact such as a chimera, both reads were discarded. Across all control oligo experiments, an average of 3.3% of reads per library were discarded this way, compared to 4.3% discarded for lacking a known UEI sequence.

Rather than storing all merged read pairs as collapsed sequences, we kept them as truncated read pairs, so that UEI sequences of mates would not interfere with mapping to reference genomes. For the sake of our control oligo experiments, in which relatively short known sequences were expected, we also stored collapsed sequences for read pairs that merged using our criteria. For such sequences, we allowed the bases within the merged region to contain at most one mismatch (the chosen base at mismatching positions was the base with the higher quality, or a random base in the case of a tie).

To reduce the risk of contamination of our sequencing data by the Illumina sequencing control DNA—phiX—due to index misassignment, we first aligned all of our raw data to the phiX genome using bwa mem ( 21) with default parameters. Across all experiments, reads aligning to the phiX genome comprised on average 0.28% of the data. We extracted reads that did not map to phiX (samtools) ( 22) and used these for downstream analyses.

Limiting to reverse reads

Because we found that overhanging adapters were less reliable when encountered on forward (P5) rather than reverse (P7) reads (see Results section on Accuracy), our analyses ignored forward reads that began with an overhanging adapter. Blunt adapters were allowed on both the forward and reverse reads. In all cases, this filtering step was applied only when computing the results of experiments, that is, all reads were included when processing, merging, and aligning, but overhanging adapters on forward reads were not allowed to affect results.

Accuracy, precision and recall measurements in control oligo experiments

When processing control oligos we expected all properly formed sequences to merge using our criteria (see Read processing above), except in cases where control oligos chained together. We defined three ways of assessing control oligo experiments, accuracy, precision and recall.

To measure accuracy, we evaluated how reliably each UEI ligates to its correct target. We generated 17 replicate libraries using an equimolar pool of control oligos containing overhangs corresponding to the overhangs types and lengths available in the UEI adapter pool. Per library accuracy is measured as the proportion of correct ligation events within that library considering only UEIs from reverse reads. The overall accuracy is averaged over 17 libraries.

To measure precision, we computed the proportion of UEI sequences that were ligated to the correct end of the control oligo with the matching overhang. In this case, we did not exclude the ends of control oligos that formed chains, thereby assessing all DNA end available for UEI adapter ligation. For every paired-end read (truncated as described in Read Processing above, but not merged), we aligned the sequence following the UEI to a reference sequence containing all control oligo sequences, separated by runs of ‘N’ bases equal to the length of the longest overhang. The best alignment, allowing up to one mismatch and with ‘N’ matching any base, was used to determine the correct control oligo sequence. Only non-chimeric alignments, i.e. within the coordinates of a single control oligo sequence were considered. Precision was then defined as the proportion of reads for which the UEI at the beginning of the read was followed by the correct type of control oligo end, in the correct orientation.

To measure recall, we computed the proportion of control oligo ends that were correctly identified using our adapters. First, all reads that merged using our criteria (see Read Processing above) were considered. Next, we constructed a reference sequence consisting of all control oligo sequences and their reverse complements, separated by runs of ‘N’ bases equal in length to the longest control oligo overhang. To determine the control oligo type of each merged read, we aligned merged reads to this reference sequence using the Edlib C++ sequence alignment library ( 23), allowing gaps at the beginning and end of the read in the alignment and allowing up to one base mismatch, letting ‘N’ match any base with no penalty. If the best alignment fell within the coordinates of a single control oligo sequence (a non-chimeric alignment), that control oligo was chosen as the correct sequence. A control oligo was considered correct if the barcode for the correct overhang was ligated to the expected overhang end of the oligo and the barcode for blunt adapters was ligated to the opposite end.

Nucleotide composition of overhang sequence

When assessing the base composition of overhang sequences, we required that all adapters be ligated to the correct type of control oligo. We considered all bases between the end of a UEI sequence and the beginning of a control oligo sequence to be the true sequence of the overhang.

Human DNA data processing

Paired-end reads that remain after filtering were truncated if necessary (see Read Processing above) and aligned to the hg19 human reference genome downloaded from the UCSC genome browser ( 24). We used bwa aln and bwa sampe ( 25) with default parameters for alignment, skipping the UEI sequences at the beginning of the reads (-B parameter). Duplicate reads were then removed using samtools rmdup. We counted as mapped only reads that were in proper pairs with a minimum map quality of 20 (samtools view –c –f66 –q20), except in the case of the restriction enzyme experiments, in which we removed the requirement for proper pairing (samtools view –c –f64 –q20) due to the possibility of chaining fragments causing chimeric alignments.

To count UEI types in mapped reads, we scanned through the BAM files using HTSLib's BAM parser ( 22) and obtained UEI sequences from the BC tag. Overhang sequences were obtained by taking a number of bases from the beginning of each read equal to the overhang length indicated by the UEI.

Downsampling data

To evaluate whether DNA termini profiles were affected by sequencing depth, we re-sequenced a library replicate of Bioruptor sonicated NA12878 gDNA (see DNA templates for XACTLY experiments above) on a Illumina NextSeq. From 6,757,000 read pairs, we down-sampled the raw sequencing data using seqtk sample to increasingly shallow sequencing depths: 5 million, 2 million, 1 million and 0.5 million and 0.1 million read pairs. We also included read data that were generated by sequencing the same library on an Illumina Miseq to a depth of ∼0.2 million reads. We then processed each of the down-sampled sequence files along with the original NextSeq and MiSeq files as described above in Human DNA data processing.

Control oligo spike-in experiments

Some sequencing libraries consisted of human DNA spiked with control oligos. To analyze these libraries, we first processed all sequencing reads as if the libraries contained only human DNA (see Human DNA data processing). Then, non-human sequences were extracted from the alignments to the human reference genome, by selecting unmapped reads and reads with map quality less than 10 (using a custom technique that can re-append barcodes to extracted read sequences). These reads, which consisted mostly of control oligos, were then processed the same way as other control oligo libraries.

Dig Deeper

Dig Deeper 1: The chromosome theory of inheritance

Chromosomes came to forefront in the late 19th/early 20th century as cell biologists began to study their structures and behaviors by microscopy. The term chromosome was coined in reference to the subcellular "bodies" (soma in ancient Greek means "body") that were brightly stained with certain chemical dyes (chroma in ancient Greek means "color"). The theory of chromosome inheritance emerged after the rediscovery of Mendel’s long-forgotten 1865 paper on inheritance in 1900 (see Narrative on Inheritance by Tilghman ). Theodor Boveri, studying roundworms, noted that chromosome numbers were reduced in half ( haploid ) during the divisions of the germ cells to produce gametes . Chromosome copy number then was restored to the normal level ( diploid ) during fertilization, which was consistent with chromosomes carrying copies of heritable material from the mother and father. Sutton, studying grasshoppers, described the pairing of maternal and paternal chromosomes and their subsequent separation during the cell divisions that produce gametes, a process that we now refer to as meiosis (see Video DD1). Sutton clearly articulated the theory by writing, "I may finally call attention to the probability that the association of paternal and maternal chromosomes in pairs and their subsequent separation during the reducing division as indicated above may constitute the physical basis of the Mendelian law of heredity."

Video DD1 Sketch video of the process of meiosis.

Dig Deeper 2: The Bell and Astbury hypothesis that DNA creates a scaffold for proteins, which jointly create a structure for heredity

Bell and Astbury were struck by the close match between the spacing between the bases in DNA (3.34 Å) and the spacing between amino acids in a stretched out polypeptide (3.4 Å). They thought that this very similar spacing must reflect some important connection between the two. Bell said in her thesis, "The most striking attribute of the nucleic acid column is the periodicity of the nucleotides, 3.34 Å, which is equal to the side-chain period of a fully extended polypeptide chain. It is difficult to believe that the agreement is no more than coincidence rather it is stimulating thought that probably the interplay of proteins and nucleic acids in the chromosome is largely based on this very circumstance." Bell and Astbury went on to speculate that DNA in chromosomes serves as a scaffold to align amino acids from proteins, and that the aligned amino acids might provide the information for heredity. This hypothesis is incorrect, but one must remember that their ideas came at a time when there were many misconceptions about DNA, such as the tetranucleotide hypothesis and the widely accepted belief that genes were proteins.

However, Bell’s general idea that proteins and nucleic acids together form (in her words) "the long scroll of life on which is written the pattern of life" ended up proving prophetic. We now know that many proteins interact with DNA (e.g., transcription factors ) and that these interactions allow the DNA code to be read out at the right time and at the right amount. In reality, proteins and DNA (not just DNA alone) determine "the pattern of life."

Dig Deeper 3: Franklin’s Photograph 51

Here, we briefly describe how the pattern of spots in Franklin’s X-ray photograph can be interpreted to provide information on the structure of DNA. Detailed information on the theory of X-ray diffraction is beyond the scope of this Narrative however, more information can be obtained in the Resource section.

The spots in the photograph are made by X-rays that interact with atoms in the DNA and are scattered, meaning that they diverge from the straight trajectory that the X-ray beam took when it entered the DNA fiber. A spot emerges if many X-rays arrive at this particular location and do so in "phase," which means that the peaks and troughs of the X-ray waves are aligned (constructive interference). If they are misaligned, a spot will not be produced. If atoms are ordered at regular intervals, then many of the scattered X-rays waves hit the photographic film in phase with one another and expose the photographic emulsion. Thus, the spots (or lines) in the photograph provide information on regular, repeating patterns of atoms in the sample. The information, however, is in a curious "reciprocal space," which means that a smaller regular spacing of atoms in the sample produces spots farther from the center.

In photograph 51, one can see 10 equally spaced "lines" from the axis, which are called "layer lines" ( Figure DD3 ). The closest layer line to the center reflects the largest regular repeating pattern of atoms in the DNA. This represents the distance of the regular helical repeat, which can be calculated from its location in the photograph to be 34 Å. The farthest layer line from the center (the arc at the top) reflects the smallest regular spacing in DNA. This is the spacing between bases, which is 3.4 Å (also noticed by Florence Bell in Clue 2 in the Journey to Discovery ). The 10 regularly spaced layer lines are produced by the 10 bases that make up one repeat of the helix. The layer lines also form a distinctive X-shape, signifying a helix. This is due to the zig-zag pattern of the phosphate backbone, as it travels to the left and then to the right across the axis of the double helix. The angle of the X in the photograph reveals the pitch angle of the helix ( Figure DD3 ). The absence of a fourth layer line is a consequence of the slight offset of the two phosphate backbones (also providing evidence of a double helix).

Figure DD3 Information on the DNA double helix provided by the pattern of lines in photograph 51 obtained by Rosalind Franklin.

Dig Deeper 4: Different forms of DNA

DNA can adopt different double-stranded structures depending upon the conditions, such as the salt concentration and water content. The B-form is found in living cells and is the form primarily discussed in this Narrative. The A-form of DNA is observed under dehydrating conditions (less water) compared to B-DNA, it is wider, the repeat distance of the helix is shorter, and the bases are slightly tilted. Another helical form is Z-DNA, which forms in a test tube under high salt and with certain base sequences (alternating C-G and G-C pairs). Z-DNA is a left-handed helix (unlike the B- and A-forms) and has a narrower diameter and longer helix repeat than the B-form. Z-DNA may form under some limited circumstances in biological systems.

Figure DD4 Three different forms of double-stranded DNA: the B-form, A-form, and Z-form. The B-form is the one found commonly in living organisms. Figure derived from PDB 101(