# 5.3: Genome Assembly II- String graph methods - Biology

Shotgun sequencing, which is a more modern and economic method of sequencing, gives reads that around 100 bases in length. Hence, we need new and more sophisticated algorithms to do genome assembly correctly.

## String graph definition and construction

The idea behind string graph assembly is similar to the graph of reads we saw in section 5.2.2. In short, we are constructing a graph in which the nodes are sequence data and the edges are overlap, and then trying to find the most robust path through all the edges to represent our underlying sequence.

Figure 5.11: Constructing a string graph 99

There are a couple of subtleties in the string graph (figure 5.11) which need mentioning:

• We have two different colors for nodes since the DNA can be read in two directions. If the overlap is between the reads as is, then the nodes receive same colors. And if the overlap is between a read and the complementary bases of the other read, then they receive different colors.
• Secondly, if A and B overlap, then there is ambiguity in whether we draw an edge from A to B, or from B to A. Such ambuigity needs to be resolved in a consistent manner at junctions caused due to repeats.

After constructing the string graph from overlapping reads, we:-

• Remove transitive edges: Transitive edges are caused by transitive overlaps, i.e. A overlap B overlaps C in such a way that A overlaps C. There are randomized algorithms which remove transitive edges in O(E) expected runtime. In figure 5.12, you can see the an example of removing transitive edges.

• Collapse chains: After removing the transitive edges, the graph we build will have many chains where each node has one incoming edge and one outgoing edge. We collapse all these chains to a single edge. An example of this is shown in figure 5.13.

## Flows and graph consistency

After doing everything mentioned above we will get a pretty complex graph, i.e. it will still have a number of junctions due to relatively long repeats in the genome compared to the length of the reads. We will now see how the concepts of flows can be used to deal with repeats.

First, we estimate the weight of each edge by the number of reads we get corresponds to the edge. If we have double the number of reads for some edge than the number of DNAs we sequenced, then it is fair to assume that this region of the genome gets repeated. However, this technique by itself is not accurate enough. Hence sometimes we may make estimates by saying that the weight of some edge is ≥ 2, and not assign a particular number to it.

We use reasoning from flows in order to resolve such ambiguities. We need to satisfy the flow constraint at every junction, i.e. the total weight of all the incoming edges must equal the total weight of all the outgoing edges. For example, in the figure 5.14 there is a junction with an incoming edge of weight 1, and two outgoing edges of weight ≥ 0 and ≥ 1. Hence, we can infer that the weights of the outgoing edges are exactly equal to 0 and 1 respectively. A lot of weights can be inferred this way by iteratively applying this same process throughout the entire graph.

## Feasible flow

Once we have the graph and the edge weights, we run a min cost flow algorithm on the graph. Since larger genomes may not a have unique min cost flow, we iteratively do the following:

• Add ε penalty to all edges in solution
• Solve flow again - if there is an alternate min cost flow it will now have a smaller cost relative to the previous flow
• Repeat until we find no new edges

After doing the above, we will be able to label each edge as one of the following

Required: edges that were part of all the solutions
Unreliable: edges that were part of some of the solutions
Not required: edges that were not part of any solution

## Dealing with sequencing errors

There are various sources of errors in the genome sequencing procedure. Errors are generally of two different kinds, local and global.

Local errors include insertions, deletions and mutations. Such local errors are dealt with when we are looking for overlapping reads. That is, while checking whether reads overlap, we check for overlaps while being tolerant towards sequencing errors. Once we have computed overlaps, we can derive a consensus by mechanisms such as removing indels and mutations that are not supported by any other read and are contradicted by at least 2.

Global errors are caused by other mechasisms such as two different sequences combining together before being read, and hence we get a read which is from different places in the genome. Such reads are called chimers. These errors are resolved while looking for a feasible flow in the network. When the edge corresponding to the chimer is in use, the amount of flow going through this edge is smaller compared to the flow capacity. Hence, the edge can be detected and then ignored.

Each step of the algorithm is made as robust and resilient to sequencing errors as possible. And the number of DNAs split and sequenced is decided in a way so that we are able to construct most of the DNA (i.e. fulfill some quality assurance such as 98% or 95%).

## Resources

Some popular genome assemblers using String Graphs are listed below

• Euler (Pevzner, 2001/06) : Indexing → deBruijn graphs → picking paths → consensus
• Valvel (Birney, 2010) : Short reads → small genomes → simplification → error correction
• ALLPATHS (Gnerre, 2011) : Short reads → large genomes → jumping data → uncertainty

## Personalized genome structure via single gamete sequencing

Genetic maps have been fundamental to building our understanding of disease genetics and evolutionary processes. The gametes of an individual contain all of the information required to perform a de novo chromosome-scale assembly of an individual’s genome, which historically has been performed with populations and pedigrees. Here, we discuss how single-cell gamete sequencing offers the potential to merge the advantages of short-read sequencing with the ability to build personalized genetic maps and open up an entirely new space in personalized genetics.

## Background

The past decade has seen rapid progress of sequencing technologies [1]. The dramatic decrease of sequencing costs has enabled an ever-accelerating flood of genomic and transcriptomic data [2] that in turn have lead to the development of a wide array of methods for data analysis. Despite recent efforts to study transcriptome evolution at large scales [3,4,5,6,7] the capability to analyze and integrate -omics data in large-scale phylogenetic comparisons lags far behind data generation. One key aspect of this shortcoming is the current lack of powerful tools for visualizing comparative -omics data. Available tools such as [8, 9] have been designed with closely related species or strains in mind. The visualizations become difficult to read for multiple species and larger evolutionary distances, where homologous genomic regions may differ substantially in their lengths, an issue that becomes more pressing the larger regions of interest become. A common coordinate system for multiple genomes is not only a convenience for graphical representations of -omics data, however. It would also greatly facilitate the systematic analysis of all those genomic features that are not sufficiently local to be completely contained within individual blocks of a genome-wide multiple sequence alignment (gMSA).

Still, gMSAs are the natural starting point. Several pipelines to construct such alignments have been deployed over the past two decades, most prominently the tba / multiz pipeline [10, 11] employed by the UCSC genome browser and the Enredo / Pecan / Ortheus ( EPO ) pipeline [12] featured in the ensembl system. For the ENCODE project data, in addition alignments generated with MAVID [13] and M-LAGAN [13] have become available, see [14] for a comparative assessment. A common feature of gMSAs is that they are composed of a large number of alignment blocks. At least in the case of MSAs of higher animals and plants the individual blocks are typically (much) smaller than individual genes. As a consequence, they are not ready-to-use for detailed comparative studies e.g. of transcriptome or epigenome [15] structure. In the gMSA-based splice site maps of [16], for example, it is easy to follow the evolution of individual splice junctions as they are localized within a block. At the same time it is difficult to collate the global differences of extended transcripts, which may span hundreds of blocks and to relate changes in transcript structure with genomic rearrangements, insertions of repetitive elements or deletion of chunks of sequence.

To a certain extent this problem is alleviated by considering the blocks arranged w.r.t. a reference genome. For many applications, however, this does not appear to be sufficient. For sufficiently similar genomes with only few rearrangements gMSA blocks are large or can at least be arranged so that large syntenic regions can be represented as a single aligned block. Any ordering of these large syntenic blocks, termed a supergenome in [17], then yields an informative common coordinate system. So far, this approach has been applied only to closely related procaryotic genomes. Prime examples are a detailed comparative analysis of the transcriptome of multiple isolates of Campylobacter jejuni [18] or the reconstruction of the phylogeny of mosses from the “nucleotide pangenome” of mitogenomic sequences [19]. We remark that some approaches to “pangenomes” are concerned with gMSAs of (usually large numbers of) closely related isolates most of this literature, however, treats pangenomes as sets of orthologous genes [20].

Here we are concerned with the coordinatization of supergenomes, i.e., the question how gMSA blocks can be ordered in a way that facilitates comparative studies of genome annotation data. In contrast to previous work on supergenomes we are in particular interested in large animal and plant genomes and in large phylogenetic ranges. We therefore assume that we have short alignment blocks and abundant genome rearrangement, leaving only short sequences of alignment blocks that are perfectly syntenic between all genomes involved. The problem of optimally sorting the MSA blocks can, as we shall see, be regarded as a quite particular variant of a vertex ordering problem, a class of combinatorial problems that recently has received increasing attention in computer science [21,22,23,24]. In the computational biology literature, furthermore, several graph-based methods have been proposed to solve the problem of sorting sequence blocks for supergenomes, see e.g. [12, 25,26,27,28,29].

This contribution is organized as follows: In the following section we first analyze the concept of the supergenome and its relationship to gMSAs in detail. We then review combinatorial optimization problems that are closely related to the “supergenome sorting problem”, and argue that the most appropriate modeling leads to a special type of betweenness ordering problem. Next, we introduce a heuristic solution that is geared towards very large input alignments and proceeds by step-wise simplification of the supergenome multigraph. Finally, we outline a few computational results.

## 1. Introduction

M ost bacteria in environments ranging from the human body to the ocean cannot be cloned in the laboratory and thus cannot be sequenced using existing Next Generation Sequencing (NGS) technologies. This represents the key bottleneck for various projects ranging from the Human Microbiome Project (HMP) (Gill et al., 2006) to antibiotics discovery (Li and Vederas, 2009). For example, the key question in the HMP is how bacteria interact with each other. These interactions are often conducted by various peptides that are produced either for communication with other bacteria or for killing them. However, peptidomics studies of the human microbiome are now limited since mass spectrometry (the key technology for such studies) requires knowledge of fairly complete proteomes. On the other hand, while studies of new peptide antibiotics would greatly benefit from DNA sequencing of genes coding for Non-Ribosomal Peptide Syntetases (NRPS) (Sieber and Marahiel, 2005), existing metagenomics approaches are unable to sequence these exceptionally long genes (over 60,000 nucleotides).

HMP and discovery of new antibiotics are just two examples of many projects that would be revolutionized by Single-Cell Sequencing (SCS). Recent improvements in both experimental (Ishoey et al., 2008 Navin et al., 2011 Islam et al., 2011) and computational (Chitsaz et al., 2011) aspects of SCS have opened the possibility of sequencing bacterial genomes from single cells. In particular, Chitsaz et al. (2011) demonstrated that SCS can capture a large number of genes, sufficient for inferring the organism's metabolism. In many applications (including proteomics and antibiotics discovery), having a great majority of genes captured is almost as useful as having complete genomes.

Currently, Multiple Displacement Amplification (MDA) is the dominant technology for single-cell amplification (Dean et al., 2001). However, MDA introduces extreme amplification bias (orders-of-magnitude difference in coverage between different regions) and gives rise to chimeric reads and read-pairs that complicate the ensuing assembly. 1 Acknowledging the fact that existing assemblers were not designed to handle these complications, Rodrigue et al. (2009) remarked that the challenges facing SCS are increasingly computational rather than experimental. Recent articles (Marcy et al., 2007 Woyke et al., 2010 Youssef et al., 2011 Blainey et al., 2011 Grindberg et al., 2011) illustrate the challenges of fragment assembly in SCS.

Chitsaz et al. (2011) introduced the E+V-SC assembler, combining a modified EULER-SR with a modified Velvet, and achieved a significant improvement in fragment assembly for SCS data. However, we (G.T. and P.A.P.), as coauthors of Chitsaz et al. (2011), realized that one needs to change algorithmic design (rather than just modify existing tools) to fully utilize the potential of SCS.

We present the SPAdes assembler, introducing a number of new algorithmic solutions and improving on state-of-the-art assemblers for both SCS and standard (multicell) bacterial datasets. Fragment assembly is often abstracted as the problem of reconstructing a string from the set of its k-mers. This abstraction naturally leads to the de Bruijn approach to assembly, the basis of many fragment assembly algorithms. However, a more advanced abstraction of NGS data considers the problem of reconstructing a string from a set of pairs of k-mers (called k-bimers below) at a distance ≈ d in a string. Unfortunately, while there is a simple algorithm for the former abstraction, analysis of the latter abstraction has mainly amounted to post-processing heuristics on de Bruijn graphs. While many heuristics for analysis of read-pairs (called bireads 2 below) have been proposed (Pevzner et al., 2001 Pevzner and Tang, 2001 Zerbino and Birney, 2008 Butler et al., 2008 Simpson et al., 2009 Chaisson et al., 2009 Li et al., 2010), proper utilization of bireads remains, arguably, the most poorly explored stage of assembly. Medvedev et al. (2011a) recently introduced Paired de Bruijn graphs (PDBGs), a new approach better suited for the latter abstraction and having important advantages over standard de Bruijn graphs. However, PDBGs were introduced as a theoretical rather than practical idea, aimed mainly at the unrealistic case of a fixed distance between reads.

We assume that the reader is familiar with the concept of A-Bruijn graphs introduced in Pevzner et al. (2004). De Bruijn graphs, PDBGs, and several other graphs in this paper are special cases of A-Bruijn graphs. SPAdes is a universal A-Bruijn assembler in the sense that it uses k-mers only for building the initial de Bruijn graph and 𠇏orgets” about them afterwards on subsequent stages it only performs graph-theoretical operations on graphs that need not be labeled by k-mers. The operations are based on graph topology, coverage, and sequence lengths, but not the sequences themselves. At the last stage, the consensus DNA sequence is restored. We designed a universal assembler to implement several variations of A-Bruijn graphs (e.g., paired and multisized de Bruijn graphs) in the same framework, and to apply it to other applications where these graphs have proven to be useful (Bandeira et al., 2007, 2008 Pham and Pevzner, 2010).

In Section 2, we give an overview of the stages of SPAdes. In Section 3, we define de Bruijn graphs, multisized de Bruijn graphs, and PDBGs. Sections 4𠄶 cover different stages of SPAdes. In Section 7, we benchmark SPAdes and other assemblers on single-cell and cultured E. coli datasets. In Section 8, we give additional details about assembly graph construction and data structures. In Section 9, we give a detailed example of constructing a paired assembly graph. In Section 10, we further discuss the concept of a universal assembler.

## An assembly of reads, contigs and scaffolds

The Newbler assembler and mapper (gsAssembler, gsMapper) was developed especially for working with the reads from the Roche/454 Life Science sequencing technology. It is one of the best programs to deal with this type of data, scoring well in the assemblathon 2 competition. Newbler has been used for many large and small genome assemblies (numerous bacteria, Atlantic cod, bonobo, tomato, to name a few). Recently, Newbler has added support for using multiple sequencing technologies, making it one of the few hybrid assembly programs available. At the Advances in Genome Biology and Technology (AGBT) in 2013, Roche announced having used the Newbler program with a hybrid 454 and Illumina dataset to improve upon the human genome.

However, the Newbler program is not open source. Luckily, researchers only need to fill out an online form to get a free copy of the software. Still, this has hampered the wide-spread adoption of this program. Newbler, for example, was not included in assembly evaluations like GAGE and GAGE-B. That Roche/454 does not want to make the source code for Newbler available is partly understandable from a commercial standpoint: at least one competitor technology (Life Tech/Ion Torrent) with a similar sequencing error-model could benefit from access to the code. In fact, in a blog post, I showed Newbler to be superior to an open-source program when assembling Ion Torrent mate-pair data.

More worringly is that the hundreds of projects that used Newbler as part of the analysis are fundamentally irreproducible without the source code for each of the different versions. This is especially the case for projects, such as the Atlantic cod genome project, that have been given access to development versions of the code, incorporating elements not available to the general community.

Last October, Roche announced it will shutdown its 454 sequencing business in mid-2016. Whatever one may feel about this decision, this further strengthens the argument for Roche/454 to make the Newbler source code open source. After the 454 shutdown, Newbler is otherwise likely to disappear too, meaning that large swathes of the literature cannot be recapitulated from the raw data. Also, long after the 454 shutdown, many researchers will have to process their 454 sequencing data, and many may still want to rely on Newbler for that purpose.

There are several other reasons why I feel the research community should be given access to the source code of Newbler. Newbler represents a very valuable contribution to the field of genome assembly and mapping. Software developers can learn from the algorithms and implementations of the Newbler code, opening up for reusing these in other programs. Also, there is the hope that developers will improve upon the program, for example by adding support for other sequencing technologies, or assembling with reads longer than the current maximum of 2 kbp.

So I hereby ask the readers of this blog for help: I have set up an online petition asking for Roche/454 to make the Newbler source code available at the latest at the time of the 454 shutdown. Please sign the petition here. Additionally, spread the word (e.g., on twitter or your own blog). Thanks in advance!

I intend to hand over the results of the petition to a Roche representative at the Advances in Genome Biology and Technology (AGBT) meeting (February 12-15, 2014).

(Thanks to Nick Loman for his constructive comments on an earlier version of this post)

## Methods

### Sample preparation

A medusa Nemopilema nomurai was collected at the Tongyeong Marine Science Station, KIOST (34.7699 N, 128.3828 E) on Sept. 12, 2013. The Sanderia malayensis samples were obtained from Aqua Planet Jeju Hanwha (Seogwipo, Korea) for transcriptome analyses of developmental stages since Nemopilema cannot be easily grown in the laboratory. The DNA and RNA preparation of Nemopilema and Sanderia are described in Additional file 1: Section 1.1. Species identification of Nemopilema was confirmed by comparing the MT-COI gene of five species of jellyfish. We aligned Nemopilema Illumina short reads (

400 bp insert-size) to the MT-COI gene of Chrysaora quinquecirrha (NC_020459.1), Cassiopea frondosa (NC_016466.1), Craspedacusta sowerbyi (NC_018537.1), and Aurelia aurita (NC_008446.1) jellyfish with BWA-MEM aligner [40]. Consensus sequences for each jellyfish were generated using SAMtools [41]. The consensus sequence from C. sowerbyi was excluded due to low coverage. We conducted multiple sequence alignment using MUSCLE [42] and ran the MEGA v7 [43] neighbor joining phylogenetic tree (gamma distribution) with 1000 bootstrap replicates. Mitochondrial DNA phylogenetic analyses confirmed the identification of the Nemopilema sample as Nemopilema nomurai.

### Genome sequencing and scaffold assembly

For the de novo assembly of Nemopilema, PacBio SMRT and five Illumina DNA libraries with various insert sizes (400 bp, 5 Kb, 10 Kb, 15 Kb, and 20 Kb) were constructed according to the manufacturers’ protocols. The Illumina libraries were sequenced using a HiSeq2500 with a read length of 100 bp (400 bp, 15 Kb, and 20 Kb) and a HiSeq2000 with a read length of 101 bp (5 Kb and 10 Kb). Quality filtered PacBio subreads were assembled into distinct contigs using the FALCON assembler [44] with various read length cutoffs. To extend contigs to scaffolds, we aligned the Illumina long mate-pair libraries (5 Kb, 10 Kb, 15 Kb, and 20 Kb) to contig sets and extended the contigs using SSPACE [45]. Gaps generated by SSPACE were filled by aligning the Illumina short-insert paired-end sequences using GapCloser [46]. We also generated TSLRs using an Illumina HiSeq2000, which were aligned to scaffolds to correct erroneous sequences and to close gaps using an in-house script. Detailed genome sequencing and assembly process are provided in Additional file 1: Section 2.2.

### Genome annotation

The jellyfish genome was annotated for protein-coding genes and repetitive elements. We predicted protein-coding genes using a two-step process, with both homology- and evidence-based prediction. Protein sequences of the sea anemone, hydra, sponge, human, mouse, and fruit fly from the NCBI database and Cnidaria protein sequences from the NCBI Entrez protein database were used for homology-based gene prediction. Two tissue transcriptomes from Nemopilema were used for evidence-based gene prediction via AUGUSTUS [47]. Final Nemopilema protein-coding genes were determined using AUGUSTUS with exon (from the homology-based gene prediction) and intron (from the evidence-based gene prediction) hints. Repetitive elements were also predicted using Tandem Repeats Finder [48] and RepeatMasker [49]. Details of the annotation process are provided in Additional file 1: Sections 3.1 and 3.2.

### Gene age estimation

Phylostratigraphy employs BLASTP-scored sequence similarity to estimate the minimal age of every protein-coding gene. The protein sequence is used to query the NCBI non-redundant database and detect the most distant species in which a sufficiently similar sequence is present inferring that the gene is at least as old as the age of the common ancestor [50]. For every species, we use the NCBI taxonomy. The timing of most divergence events is estimated using TimeTree [51] and the Encyclopedia of Life [52]. To facilitate detection of sequence similarity, we use the e value threshold of 10 −3 . We evaluate the age of all proteins whose length is equal or greater than 40 amino acids. We count the number of genes in each phylostratum, from the most ancient (PS 1) to the newest (PS 11). To see broad evolutionary patterns, we aggregate the counts from several phylostrata into three broad evolutionary eras: ancient (PS 1–5, cellular organisms to Eumetazoa, 4204 Mya to 741 Mya), middle (PS 6–7, Cnidaria to Scyphozoa, 741 Mya to 239 Mya), and young (PS 8–11, Rhizostomeae to Nemopilema nomurai, 239 Mya to present).

### Comparative evolutionary analyses

Orthologous gene clusters were constructed to examine the conservation of gene repertoires among the genomes of the Nemopilema nomurai, Aurelia aurita, Hydra vulgaris, Clytia hemisphaerica, Acropora digitifera, Nematostella vectensis, Caenorhabditis elegans, Danio rerio, Drosophila melanogaster, Homo sapiens, Trichoplax adhaerens, Amphimedon queenslandica, Mnemiopsis leidyi, and Monosiga brevicollis using OrthoMCL [53]. To infer a phylogeny and divergence times, we used RAxML [54] and MEGA7 program [43], respectively. A gene family expansion and contraction analysis was conducted using the Café program [55]. Domain regions were predicted by InterProScan [56] with domain databases. Details of the comparative analysis are provided in Additional file 1: Sections 4.1–4.3.

### Transcriptome sequencing and expression profiling

Illumina RNA libraries from Nemopilema nomurai and Sanderia malayensis were sequenced using a HiSeq2500 with 100-bp read lengths. Since there is not a reference genome for S. malayensis, we de novo assembled a pooled six RNA-seq read set using the Trinity assembler [57]. Quality filtered RNA reads from Nemopilema and Sanderia were aligned to the Nemopilema genome assembly and the assembled transcripts, respectively, using the TopHat [58] program. Expression values were calculated by the Fragments Per Kilobase Of Exon Per Million Fragments Mapped (FPKM) method using Cufflinks [58], and differentially expressed genes were identified by DEGseq [59]. Details of the transcriptome analysis are presented in Additional file 1: Sections 5.2 and 7.1.

### Hox and ParaHox analyses

We examined the homeodomain regions in Nemopilema using the InterProScan program. Hox and ParaHox genes were identified in Nemopilema by aligning the homeodomain sequences of human and fruit fly to the identified Nemopilema homeodomains. We considered only domains that were aligned to both the human and fruit fly. We also used this process for Acropora, Hydra, and Nematostella for comparison. Additionally, we added one Hox gene for Acropora and two Hox genes for Hydra, which are absent in the NCBI gene set, though they were present in previous studies [23, 60]. Hox and ParaHox genes of Clytia hemisphaerica, a hydrozoan species with a medusa stage, were also added based on a previous study [61]. Finally, a multiple sequence alignment of these domains was conducted using MUSCLE, and a FastTree [62] maximum likelihood phylogeny was generated using the Jones–Taylor–Thornton (JTT) model with gamma option.

### Wnt gene subfamily analyses

Wnt genes of Nematostella and Hydra were downloaded from previous studies [25, 63], and those of Acropora were downloaded from the NCBI database. Wnt genes in Nemopilema and Aurelia were identified using the Pfam database by searching for “wnt family” domains. A multiple sequence alignment of Wnt genes was conducted using MUSCLE, and aligned sequences were trimmed using the trimAl program [64] with “gappyout” option. A phylogenetic tree was generated using RAxML with the PROTGAMMAJTT model and 100 bootstraps.

## Methods

### Plant materials

The TDr96_F1 line used for WGS was selected from F1 progeny obtained from an open-pollinated D. rotundata breeding line (TDr96/00629) grown under field conditions in the experimental fields of the International Institute of Tropical Agriculture (IITA) in Nigeria. F1 seeds from TDr96/00629 and those obtained from the cross between the parental lines TDr97/00917 and TDr99/02627 used for RAD-seq were germinated on wet paper towels in darkness at 28 °C. After germination, the seeds were transferred to soil (Sakata Supermix A [34]) and grown at 30 °C with a 16-h/8-h photoperiod in a greenhouse at Iwate Biotechnology Research Institute (IBRC) in Japan. Fresh leaf samples were collected for DNA extraction. Additionally, to resequence the F1 progeny used for QTL-seq analysis, lyophilized leaf samples obtained from plants that were grown and phenotyped under field conditions at IITA were used for DNA extraction.

### Determination of chromosome number and ploidy level

For chromosome observation, root tips of TDr96_F1 plants generated by in vitro propagation of nodal explants were sampled and fixed in acetic acid-alcohol (1:3 ratio) for 24 h without pretreatment. Fixed root tips were stained with a 1% aceto-carmine solution for 24 h. Samples were prepared by the squash method and analyzed under an Olympus BX50 optical microscope (Olympus Optical Co, Ltd., Tokyo, Japan [35]) at 400× magnification.

### Estimation of D. rotundata genome size

The genome size of TDr96_F1 (D. rotundata) was estimated both by FCM and k-mer analyses. FCM analysis was carried out using nuclei prepared from fresh leaf samples of TDr96_F1 and a japonica rice (Oryza sativa L.) cultivar of known genome size (

380 Mb [36]), which served as an internal reference standard. Nuclei were isolated and stained with propidium iodide (PI) simultaneously and analyzed using a Cell Lab Quanta™ SC Flow Cytometer (Beckman Coulter, CA [37]) following the manufacturer’s protocol. The ratio of G1 peak means [yam (281.7):rice (188.7) = 1.493] was used to estimate the genome size of D. rotundata to be

570 Mb (380 Mb × 1.5). k-mer analysis-based genome size estimation [10] was performed with TDr96_F1 PE reads with an average size of

230 bp and a total length of 16.77 Gb (16,771,579,510 bp) using ALLPATHS-LG [11]. k-mer frequency analysis, with the k-mer size set to 25, generated values for k-mer coverage (Kc = 25.66) and mean read length (Rl = 228.8), which were used to estimate the genome size of TDr96_F1 to 579 Mb as follows:

### Whole-genome sequencing

For WGS, genomic DNA was extracted from fresh TDr96_F1 leaf samples using a NucleoSpin Plant II Kit according to the manufacturer’s protocol (Macherey-Nagel GmbH & Co. KG [38]) with slight modifications. Homogenized samples were washed with 0.1 M 4-(2-hydroxyethyl)-1-piperazineethanesulfonic acid (HEPES) buffer to remove contaminating polysaccharides. Just before use, 120 mg polyvinylpyrrolidone (PVP), 90 mg l -ascorbic acid, and 200 μl 2-mercaptoethanol (ME) were added to 10 ml HEPES buffer, and 1 ml of the mixture was used to wash each sample washing was repeated three times. Additionally, 10 μl 2-ME and 5 μl of 30% polyethylene glycol (PEG)-20000 were added to 1 ml of PL1 buffer (provided with the NucleoSpin Plant II Kit), and twice the recommended volume of buffer (800 μl) was used for cell lysis. Libraries for PE short reads and MP jump reads of various insert sizes including 2, 3, 4, 5, 6, and 8 kb were constructed using an Illumina TruSeq DNA LT Sample Prep Kit and a Nextera Mate Pair Sample Prep Kit, respectively. The PE library was sequenced on the Illumina MiSeq platform, while the MP libraries were sequenced on the HiSeq 2500 platform. Library construction and sequencing of the 20- and 40-kb MP jump sequences were carried out by Eurofins Genomics (Operon [39]) and Lucigen [40], respectively. The 20-kb and 40-kb jump libraries were sequenced on the MiSeq and HiSeq 2500 platforms, respectively. BAC libraries were constructed by Lucigen, and BAC-end sequencing was carried out by Genaris [41] using Sanger sequencing. A total of 30,750 clones corresponding to 3072 Mb of sequence and 5.4× genome coverage were constructed. Of these, 9984 clones were used for BAC-end sequencing, generating a 13.6-Mb sequence in PE fasta format, which was converted to 50-bp PE short reads corresponding to a 0.46-Gb sequence and

0.8× coverage of the estimated 470-Mb D. rotundata genome (Additional file 1: Table S2 and Additional file 2: Figure S2).

### Constructing organelle genome sequences

De novo assembly of the D. rotundata mitochondrial genome sequence was performed using mitochondrial DNA isolated from TDr96_F1 leaf samples according to the method of Terachi and Tsunewaki [43] with the following minor modifications. Fresh green leaves (ca. 150 g) were homogenized in 1.5 L of homogenization buffer containing 0.44 M mannitol, 50 mM Tris-HCl (pH 8.0), 3 mM ethylenediaminetetraacetic acid (EDTA), 5 mM 2-ME, 0.1% (w/v) bovine serum albumin, and 0.1% (w/v) PVP. Following DNaseI treatment, the mitochondrial fraction was collected from the interface between 1.30 M and 1.45 M of a sucrose gradient. Mitochondrial DNA was purified by EtBr/CsCl centrifugation at 80,000 rpm for 6 h at 20 °C in a Beckman TLA 100.3 rotor. The DNA band was collected and purified by ethanol precipitation. The resulting mitochondrial DNA (15 ng) was amplified using a REPLI-g Mini Kit (Qiagen, Cat. no. 150023) and used for library construction. The library was sequenced on an Illumina MiSeq sequencer, and the resulting PE reads were assembled de novo using DISCOVAR De Novo [29], generating D. rotundata mitochondria contigs. For scaffolding, MP reads with insert sizes of 2, 3, 4, 5, 6, 8, and 20 kb obtained from D. rotundata genomic DNA (gDNA) were aligned to the D. rotundata mitochondrial contigs. MP reads showing 100% alignment were selected and used for scaffolding of D. rotundata mitochondrial contigs by SSPACE [13] (Additional file 2: Figure S5). To reconstruct the D. rotundata chloroplast genome sequence, the PE reads of TDr96_F1 were aligned to the recently published D. rotundata chloroplast genome sequence [16] (GenBank ID = NC_024170.1) by Burrows-Wheeler alignment (BWA) [44], and chloroplast-derived sequences were identified, amounting to 5,403,420 reads (14.74% of the total size of PE reads generated for TDr96_F1 [Table 1]) matching the assembled 155.4-kb chloroplast genome of D. rotundata.

### Evaluation of the completeness of the genomic assembly

To evaluate the completeness of the D. rotundata genome assembly, the assembly was checked for the presence of 248 highly conserved core eukaryotic genes [45] using CEGMA version 2.4 with default parameters [14] (Additional file 1: Table S4). To further assess the completeness of the genome, the successor to CEGMA, Benchmarking Universal Single-Copy Orthologs (BUSCO), was used to check for the presence of 956 BUSCOs with version 1.1.b1 [15] using the early access plant dataset (Additional file 1: Table S5).

### Annotation of transposable elements (TEs)

Legacy repetitive sequences, including transposons, were predicted using CENSOR 4.2.29 [46] with the following options: show_simple, nofilter, and mode rough using the Munich Information Center for Protein Sequences (MIPS) Repeat Element Database [47]. Following identification, the repeat elements were classified using mips-REcat [47]. Repetitive sequences were later improved by remodeling using RepeatModeler 1.0.8 [48] and masked with RepeatMasker 4.0.5 [49]. Using the National Center for Biotechnology Information (NCBI) database, one of three other options was used to generate interspersed RepeatModeler-based, interspersed Rebase-based, and Low complexity repeats: “nolow”, “nolow, species Viridiplantae”, and “noint”, respectively. Repeat element content and other statistics were compared between the D. rotundata and A. thaliana TAIR10 [50], B. distachyon v3.1 [51], and O. sativa v7_JGI 323 [52] genomes using the RepeatModeled and RepeatMasked references (Table 1).

### RNA-seq

Total RNA was extracted using leaf, stem, flower, and tuber samples collected from a greenhouse-grown TDr96_F1 plant using a Plant RNeasy Kit (Qiagen [53]) with slight modifications. RLC buffer was used for lysis after the addition of 5 μl 30% PEG-20000 and 10 μl 2-ME to 1 ml of buffer. The RNA samples were treated with DNase (Qiagen) to remove contaminating genomic DNA. Two micrograms of total RNA was used to construct complementary DNA (cDNA) libraries using a TruSeq RNA Sample Prep Kit V2 (Illumina) according to the manufacturer’s instructions. The libraries were used for PE sequencing using 2× 100 cycles on the HiSeq 2500 platform in high-output mode. Illumina sequencing reads were filtered by Phred quality score, and reads with a quality score of ≥ 30 (≥90% of reads) were retained (Additional file 1: Table S12). Only one RNA-seq experiment was carried out per tissue/organ (indicated as sample in Additional file 1: Table S12).

### Prediction of protein-coding genes

The legacy gene models were generated previously using the legacy repeat-masked reference genome and three approaches: ab initio, ab initio supported by evidence-based prediction, and evidence-based prediction. The ab initio prediction was carried out with FGENESH 3.1.1 [54]. The ab initio supported by evidence-based prediction was performed with AUGUSTUS 3.0.3 [55] using the maize5 training set and a hint file as the gene model support information. To construct the hint file, TopHat 2.0.11 [56] was used to align RNA-seq reads from tuber, flower (young), leaf (young), stem, leaf (old), and flower (old) samples to the D. rotundata reference genome, and Cufflinks 2.2.1 [57] was used to generate gene models from these data. The evidence-based predictions using the Program to Assemble Spliced Alignments (PASA) [58] were generated in a Trinity [59] assembled transcriptome from the RNA-seq data. JIGSAW 3.2.9 [60] was used to select and combine the gene models obtained using the three approaches with the weighting values assigned to the results from FGENESH, AUGUSTUS, and PASA of 10, 3, and 3, respectively. In total, 21,882 consensus gene models were predicted. These gene models were further improved upon using the MAKER [61] pipeline (Additional file 2: Figure S14). Publicly available ESTs and protein sequences from related plant species were aligned to the genome using GMAP [62] and Exonerate 2.2.0 [63], respectively. De novo and reference-guided transcripts were assembled from RNA-seq data from all 18 tissues using Bowtie 1.1.1 [64], Trinity 2.0.6 and SAMtools 1.2.0 [65], and Trinity 2.0.6 and TopHat 2.1.0, respectively. Both sets of assembled transcripts were used to build a comprehensive transcript database using PASA (Additional file 1: Table S13). High-quality non-redundant transcripts from PASA were used to generate a training set for AUGUSTUS 3.1. Gene models were predicted twice using the genome, improved repeat sequences, assembled transcripts, EST and protein alignments, the AUGUSTUS training set, and a legacy set of 21,882 gene models obtained previously using MAKER 2.31.6 [61], retaining all legacy gene models or querying them with new evidence and discarding those that could not be validated. From both MAKER runs, 21,894 and 76,449 gene models were predicted, respectively. A consensus set of gene models from both MAKER outputs was obtained using JIGSAW 3.2.9 [60] at a 1:1 ratio. In total, 26,198 consensus gene models were predicted in the D. rotundata genome. The corresponding amino acid sequences were also predicted for these gene models. To confirm these gene models, the RNA-seq reads were aligned to the CDSs (coding sequences) of the predicted genes using BWA [44] with default parameters. Accordingly, 85.8% of the gene models could be aligned by at least a single RNA-seq read. Functional annotation of the amino acid sequences was performed using the in-house pipeline, AnnotF, which compares Blast2GO [66] and InterProScan [67] functional terms.

### Comparative genomics

Pairwise orthology relationships were determined with Inparanoid [68,69,70] using the longest protein-coding isoform for each gene in Arabidopsis thaliana (TAIR10) [50], Oryza sativa japonica (v7.0) [52], Brachypodium distachyon (v3.1) [71], Musa acuminata (v2) [72], Elaeis guineensis (EG5) [73], and Phoenix dactylifera (DPV01) [74]. Orthology clusters across all seven species were determined using Multiparanoid [75]. Sequences for the 12 classes of lectins were obtained from UniProt [76] for the proteomes of A. thaliana (up000006548), B. distachyon (up000008810), and O. sativa (up000059680). Protein alignments for B-lectin class protein sequences from all three of these species and D. rotundata were generated using the program Multiple Alignment using Fast Fourier Transform (MAFFT) [77]. Maximum likelihood trees were constructed based on the concatenated alignments of all 378 B-lectin proteins using RAxML [78] 8.0.2 with 1000 bootstraps. Enrichment of tuber-specific genes was detected using TopHat 2.1.0 to align RNA-seq data from each of the 12 tissues to the genome, with one biological replicate for each tissue. HTSeq 0.6.1 [79] was used to generate raw counts. Then the Bioconductor package DESeq2 1.14.1 [80] was used to compare raw counts of the three tuber tissues against all the other nine tissues (Additional file 1: Table S12) to determine tuber-enriched gene expression based on a log2 fold change > 0 and Benjamini–Hochberg [25] adjusted P value < 0.05.

Gene enrichment analysis of orthology clusters was performed with GOATOOLS [81], using the Holm significance test, and the false discovery rate was adjusted using the Benjamini–Hochberg procedure [25]. The list of enriched genes was filtered for redundant Gene Ontology (GO) terms using REVIGO [82]. For the species phylogeny, protein alignments for each gene with a 1:1 orthologous relationship across all monocot species were generated with MAFFT using the longest protein isoform. Maximum likelihood trees were constructed based on the concatenated alignments of 2381 orthologous protein-coding genes using RAxML 8.2.8 [78] with a JTT + Γ model and 1000 bootstraps.

SynMAP [83] using BLASTZ [84] alignments, DAGchainer [85] (options -D 30 and -A 2), and no merging of syntenic blocks were used as part of the CoGe platform [86] to identify syntenic blocks between the hard-masked pseudo-chromosomes of D. rotundata and scaffolds/contigs of Oryza sativa japonica (A123v1.0), Spirodela polyrhiza (v0.01), and Phoenix dactylifera L. (v3). A syntenic path assembly was then carried out on each of the same three species in SynMap using synteny between the scaffolds/contigs against D. rotundata pseudo-molecules. The syntenic path assembly is a reference-guided assembly that uses the synteny between two species to order and orientate contigs. This approach highlights regions of conservation that were otherwise too shuffled to be clearly observed. Self-self synteny analysis of D. rotundata pseudo-chromosomes was carried out using SynMap Last alignments with default parameters and syntenic gene pair synonymous rate change calculated by CodeML [87].

RAD-seq was performed as previously described [88] with a minor modification. Genomic DNA was digested with the restriction enzymes PacI and NlaIII to prepare libraries used to generate PE reads by Illumina HiSeq 2500 (Additional file 2: Figure S6). Approximately 822.7-Mb and 250.4-Mb sequence reads covering 22.9% and 5.3% of the estimated 504-Mb D. rotundata genome sequence, excluding gap regions, at average depths of 7.2× and 9.8× were generated for the parental lines and F1 individuals, respectively (Additional file 2: Figure S7).

#### Library preparation and sequencing

For library construction, 1 μg DNA obtained from the two parental lines (TDr97/00917-P1 and TDr99/02627-P2) and the 150 F1 individuals was digested with PacI, which recognizes 5’-TTAATTAA-3’, and a biotinylated adapter-1 was ligated to the digested DNA fragments. The adapter-1-ligated DNA fragments were digested with a second enzyme, NlaIII (5’-CATG-3’). After collecting the biotinylated fragments using streptavidin-coated magnetic beads, adapter-2 was ligated to the NlaIII-digested ends. The adapter-ligated DNA was amplified using primers containing sample-specific index sequences, adapter-1 (F) and adapter-2 (R) sequences, and sequences corresponding to the P7 and P5 primers for Illumina sequencing library preparation (Additional file 2: Figure S6). The PCR products were pooled in equal proportions, purified, and subjected to PE sequencing on the Illumina HiSeq 2500 platform. Detailed information about the primers (P7 and P5) used for Illumina library preparation are given in Additional file 1: Table S20.

#### Identification of parental line-specific heterozygous markers

RAD-tags were aligned to the D. rotundata reference genome using BWA [44]. The aligned data were converted to SAM/BAM files using SAMtools [65], and the RAD-tags with mapping quality < 60 or containing insertions/deletions in the alignment data were excluded from analysis. Low mapping positions including those with only a single RAD-tag and a mapping quality score of < 30 were also excluded. SNP-index values [28] were calculated at all SNP positions. For linkage mapping, two types of heterozygous markers (SNP-type and presence/absence-type) were identified (Additional file 2: Figure S8). The SNP-type heterozygous markers were defined based on SNP-index patterns of the parental line RAD-tags. For example, positions with SNP-index values ranging from 0.2 to 0.8 in P1 but homozygous in P2 with SNP-index values of either 0 or 1 were defined as P1-specific heterozygous SNPs. A similar procedure was followed to identify P2-specific heterozygous SNP markers. The selected markers were filtered using depth information at all positions. To increase the accuracy of the selected markers, their segregation (1:1 ratio) was confirmed in 150 F1 individuals obtained from a cross between P1 and P2. If the segregation ratio was out of the confidence interval (P < 0.05) hypothesized by the binomial distribution, B(n = number of individuals, P = 0.5), the markers were excluded from further analysis. Only one marker was selected per 10-kb interval based on the number of F1 individuals represented and tag coverage. A total of 1105 and 990 P1- and P2-heterozygous SNP markers were selected, respectively (Additional file 1: Table S7).

The presence/absence-type markers were defined based on the alignment depth of parental line RAD-tags. First, genomic positions that could be aligned by RAD-tags from only one of the parental lines were identified. Additionally, aligned tags should be heterozygous for that particular region. Similar to the SNP-type markers, the segregation patterns of candidate presence/absence-type markers in the F1 progeny were confirmed, and only those that segregated at a 1:1 ratio (as confirmed by binomial distribution filter) were retained. In the F1 progeny, positions with sequencing depths of ≥ 3 and 0 were defined as heterozygous and homozygous, respectively. For a given candidate position/marker, if the number of F1 individuals defined as homozygous or heterozygous was less than 120, the marker was excluded from further analysis. Only one heterozygous position was selected as a marker within a given 10-kb interval. In total, 221 and 282 positions were selected as P1- and P2-specific presence/absence-type heterozygous markers, respectively (Additional file 1: Table S7).

To developing parental line-specific linkage maps, P1-Map and P2-Map, recombination fraction (rf) values between all pairs of markers on a given scaffold were calculated for both parents using the recombination pattern of the 150 F1 individuals. To minimize incorrect mapping, scaffolds were divided at positions where rf values exceeded 0.25 from the initial marker position (Additional file 2: Figure S9). Only two flanking (distal) markers per scaffold were selected, corresponding to 477 and 493 P1- and P2-specific markers, respectively. These markers were used to develop P1 and P2 linkage maps according to the pseudo-testcross method [18] using the backcross model of R/qtl [89]. Due to the use of the pseudo-testcross method, the initial maps contained both the coupling and repulsion-type markers. Consequently, the genetic distance in linkage groups was larger than expected. To avoid the effect of repulsion-type markers when calculating genetic distances, these markers were converted to coupling-type markers. If a marker showed a high logarithm of odds (LOD) score and an rf value > 0.5, it was defined as repulsion type and was therefore converted to the coupling-type genotype. This conversion was carried out gradually by changing the threshold of the LOD score from 10 to 5, and then to 3. After converting all repulsion markers to coupling markers, linkage maps were developed using markers showing LOD score > 3 and rf value < 0.25. Accordingly, a total of 21 and 23 linkage groups, each with a minimum of three markers, were generated for P1- and P2-Maps, respectively (Additional file 1: Table S8 and Additional file 2: Figure S10).

#### Anchoring scaffolds

To develop chromosome-scale pseudo-molecules, TDr96_F1 scaffold sequences were anchored onto the two parental-specific linkage maps using the selected RAD markers. To combine the two maps, the number of scaffolds shared between all possible linkage group (LG) pairs corresponding to the two maps was determined (Additional file 2: Figures S11, S12). LG pairs that shared the largest number of scaffolds were combined using the same scaffolds. Each combined LG represented a pseudo-chromosome, which was designated/numbered according to the P1-Map LG designation (see Fig. 2 and Additional file 2: Figure S11). After combining the two maps to construct the pseudo-chromosomes, P1- and P2-specifc scaffolds were ordered according to their original order in their respective LGs. If the order of scaffolds could not be decided because the order was similar in both the P1- and P2-Maps, the order in P1-LG was adopted (Fig. 2). Finally, the ordered scaffolds were connected by 1000 nucleotides of “N” into a single fasta file for each pseudo-chromosome (Additional file 2: Figure S12).

### QTL-seq analysis

DNA samples obtained from the two parental lines, TDr97/00917 (P3, female) and TDr97/00777 (P4, male), as well as samples pooled in equal amounts from 50 male (male-bulk) and 50 female (female-bulk) F1 individuals obtained from the cross between P3 and P4 were subjected to WGS. Libraries for sequencing were constructed from 1-μg DNA samples with a TruSeq DNA PCR-Free LT Sample Preparation Kit (Illumina) and were sequenced via 76 cycles on the Illumina NextSeq 500 platform. Short reads in which more than 20% of sequenced nucleotides exhibited a Phred quality score of < 20 were excluded from further analysis. To perform QTL-seq analysis of F1 progeny, two types of analyses are required. In the first analysis, the SNP index and ΔSNP index are calculated at P4-specific heterozygous positions. The second analysis is performed using P3-specific heterozygous positions. To identify P4-specific heterozygous positions, the P3 “reference sequence” was first developed by aligning P3 reads to the reference genome sequence of D. rotundata and replacing nucleotides of the D. rotundata reference genome sequence with nucleotides of P3 at all SNP positions showing an SNP index of 1 (Additional file 2: Figure S17c). SNP detection, calculation of SNP index, and replacement of SNPs were carried out via step 2 of QTL-seq pipeline version 1.4.4 [90]. Short reads obtained from both the male and female parents were then aligned to the “reference sequence” and heterozygous SNP positions between the two were extracted. A SNP was defined as heterozygous if the same position showed an SNP-index value ranging from 0.4 to 0.6 in one parent and a value of 0 in the second parent. Of the selected markers/positions, only those having enough depth in both parents (>15) were used for analysis of SNP-index values in the bulk-sequenced samples. P3-specific heterozygous positions were identified similarly using the P4 “reference sequence.”

After identifying P4- and P3-specific heterozygous positions, the Illumina reads from the two bulk-sequenced samples (male and female bulks) were aligned to the reference sequences using BWA [44] and subjected to Coval filtering [91] as previously described. When the P3 reference sequence was used for alignment, the SNP-index values were calculated only at all of the P4-specific heterozygous positions. By contrast, when the P4 reference sequence was used for alignment, the SNP-index values were calculated only at the P3-specific heterozygous positions. In both cases, positions with shallow depth (< 6) in either of the two samples were excluded from analysis. The ∆SNP index was calculated by subtracting the SNP-index values of the male bulk from those of the female bulk. To generate confidence intervals of the SNP-index value, an in silico test simulating the application of QTL-seq to DNA bulked from 50 randomly selected F1 individuals was performed as described previously [28] (Additional file 2: Figure S22). The simulation test was repeated 10,000 times depending on the alignment depth of short reads to generate confidence intervals. These intervals were plotted for all SNP positions analyzed. Finally, sliding window analysis was applied to SNP-index, ∆SNP-index, and confidence interval plots with a 1-Mb window size and a 50-kb increment to generate SNP-index graphs (Additional file 2: Figure S18).

### Identification of putative W-region by de novo assembly of female and male parental genomes and mapping of bulked DNA from female and male F1 progeny

DNA samples obtained from the two parental lines, TDr97/00917 (P3, female) and TDr97/00777 (P4, male), were separately subjected to de novo assembly. Libraries for sequencing were prepared with a TruSeq DNA PCR-Free LT Sample Preparation Kit (Illumina) and were sequenced for 251 cycles on the Illumina MiSeq platform. Contigs were generated using the DISCOVAR De Novo assembler [29], resulting in P3-DDN and P4-DDN, respectively. Separately, whole-genome resequencing of bulked DNA was performed on bulked DNA samples obtained from 50 female F1 (Female-bulk.fastq) and 50 male F1 (Male-bulk.fastq) progeny, all derived from a cross between P3 and P4. Two reference sequences, P3-DDN and P4-DDN, were combined to generate P3-DDN/P4-DDN. Short reads from the female and male bulks were separately mapped to P3-DDN/P4-DDN using the alignment software BWA [44]. After mapping, the MAPQ scores of the aligned reads were obtained. Under our conditions, if a short read was mapped to a unique position of the reference sequence, the MAPQ score was 60, whereas if the read was mapped to multiple positions, MAPQ was < 60. Since two reference sequences (P3-DDN and P4-DDN) were fused to generate P3-DDN/P4-DDN, most genomic regions were represented twice. Therefore, most short reads mapped to two or more positions, leading to a MAPQ score < 60. The reads that mapped to the P3-DDN/P4-DDN with MAPQ = 60 were judged to be located in either P3- or P4-specific genomic regions. After finding these P3- or P4-specific genomic regions, the depth of short reads that covered the regions for Female-bulk.fastq and Male-bulk.fastq, respectively, was evaluated. If the depth of Female-bulk.fastq was high and the depth of Male-bulk.fastq was 0 or close to 0, such genomic regions were retained as putative W-regions (Fig. 6 and Additional file 2: Figure S20).

### DNA markers linked to sex

The primer sequences used for amplification of sex-linked markers sp1 and sp16, as well as the control Actin gene fragment (Dr-Actin), were as follows:

## References

Wheeler DL, Tanya B, Benson DA, Bryant SH, Kathi C, Vyacheslav C, Church DM, Michael D, Ron E, Scott F, Michael F, Geer LY, Wolfgang H, Yuri K, Oleg K, David L, Lipman DJ, Madden TL, Maglott DR, Vadim M, James O, Pruitt KD, Schuler GD, Shumway M, Sequeira E, Sherry ST, Sirotkin K, Souvorov A, Starchenko G, Tatusov RL, Tatusova TA, Wagner L, Yaschenko E. Database resources of the National Center for Biotechnology Information. Nucleic Acids Res. 200836:D13–21.

Buchfink B, Xie C, Huson DH. Fast and sensitive protein alignment using DIAMOND. Nat Methods. 201512:59–60.

Kanehisa M, Goto S. KEGG: Kyoto Encyclopedia of Genes and Genomes. Nucleic Acids Res. 200028(1):27–30.

Mitchell A, Chang H-Y, Daugherty L, Fraser M, Hunter S, Lopez R, McAnulla C, McMenamin C, Nuka G, Pesseat S, Sangrador-Vegas A, Scheremetjew M, Rato C, Yong SY, Bateman A, Punta M, Attwood TK, Sigrist CJA, Redaschi N, Rivoire C, Xenarios L, Kahn D, Guyot D, Bork P, Letunic I, Gough J, Oates M, Haft D, Huang H, Natale DA, Wu CH, Orengo C, Sillitoe I, Huaiyu M, Thomas PD, Finn RD. The InterPro protein families database: the classification resource after 15 years. Nucleic Acids Res. 201543(D1):D213–21.

Huson DH, Beier S, Flade I, Górska A, El-Hadidi M, Mitra S, Ruscheweyh HJ, Rewati T. MEGAN Community Edition—interactive exploration and analysis of large-scale microbiome sequencing data. PLoS Comput Biol. 201612(6):e1004957. doi:10.1371/journal.pcbi.1004957.

Yu P, Leung HCM, Yiu SM, Chin FYL. IDBA-UD: a de novo assembler for single-cell and metagenomic sequencing data with highly uneven depth. Bioinformatics. 201228(11):1420–8.

Boisvert S, Laviolette F, Corbeil J. Ray: simultaneous assembly of reads from a mix of high-throughput sequencing technologies. J Comput Biol. 201017(11):1519–33.

Luo R, Liu B, Xie Y, Li Z, Huang W, Yuan J, He G, Chen Y, Pan Q, Liu Y, Tang J, Wu G, Zhang H, Shi Y, Liu Y, Yu C, Wang B, Lu Y, Han C, Cheung DW, Yiu SM, Peng S, Xiaoqian Z, Liu G, Liao X, Li Y, Yang H, Wang J, Lam TW, Wang J. SOAPdenovo2: an empirically improved memory-efficient short-read de novo assembler. GigaScience. 20121(1):18.

Wang Q, Fish JA, Gilman M, Sun Y, Brown CT, Tiedje JM, Cole JR. Xander: employing a novel method for efficient gene-targeted metagenomic assembly. Microbiome. 20153(1):1–13.

Shakya M, Quince C, Campbell JH, Yang ZK, Schadt CW, Podar M. Comparative metagenomic and rRNA microbial diversity characterization using archaeal and bacterial synthetic communities. Environ Microbiol. 201315(6):1882–99.

Myers EW. The fragment assembly string graph. Bioinformatics. 200521 suppl 2:ii79–85.

Myers EW, Sutton GG, Delcher AL, Dew IM, Fasulo DP, Flanigan MJ, Kravitz SA, Mobarry CM, Reinert KHJ, Remington KA, Anson EL, Bolanos RA, Chou H-H, Jordan CM, Halpern AL, Lonardi S, Beasley EM, Brandon RC, Chen L, Dunn PJ, Lai Z, Liang Y, Nusskern DR, Zhan M, Zhang Q, Zheng X, Rubin GM, Adams MD, Venter JC. A whole-genome assembly of Drosophila. Science. 2000287:2196–204.

Kevin Daniel Sedgewick, Robert Wayne. Algorithms. Addison-Wesley Professional, fourth edition, 2011.

R Overbeek, R Olson, GD Pusch, GJ Olsen, JJ Davis, T Disz, RA Edwards, S Gerdes, B Parrello, M Shukla, V Vonstein, AR Wattam, F Xia, and R Stevens. The SEED and the Rapid Annotation of microbial genomes using Subsystems Technology (RAST). Nucleic Acids Res. 201342(Database issue): D206–D214.

Powell S, Szklarczyk D, Trachana K, Roth A, Kuhn M, Muller J, Arnold R, Rattei T, Letunic I, Doerks T, Jensen LJ, von Mering C, Bork P. eggNOG v3.0: orthologous groups covering 1133 organisms at 41 different taxonomic ranges. Nucleic Acids Res. 201240(Database-Issue):284–9.

Wu D, Jospin G, Eisen JA. Systematic identification of gene families for use as “Markers” for phylogenetic and phylogeny-driven ecological studies of bacteria and archaea and their major subgroups. PLoS One. 20138:10.

Katoh K, Misawa K, Kuma K, Miyata T. MAFFT: a novel method for rapid multiple sequence alignment based on fast Fourier transform. Nucleic Acids Res. 200230(14):3059–66.

Katoh K, Standley DM. MAFFT multiple sequence alignment software version 7: improvements in performance and usability. Mol Biol Evol. 201330(4):772–80.

Treangen T, Koren S, Sommer D, Liu B, Astrovskaya I, Ondov B, Darling A, Phillippy A, Pop M. MetAMOS: a modular and open source metagenomic assembly and analysis pipeline. Genome Biol. 201314(1):R2.

Howe AC, Jansson JK, Malfatti SA, Tringe SG, Tiedje JM, Brown CT. Tackling soil diversity with the assembly of large, complex metagenomes. Proc Natl Acad Sci USA. 2014111(13):4904–9.

## Results

### The rise of NGS and de novo assembler use in GenBank viral sequences

GenBank viral entries from 1982 to 2019 were collected and analyzed, with extensive analyses performed to evaluate technologies and bioinformatics programs cited in records deposited between 2011 and 2019. Through 2019, there were over 2.7 million viral entries in GenBank however, over 70% (1.9 million) do not specify a sequencing technology (Supplement Table S1) due to the looser data requirement in earlier years. When looking at recently deposited records (2014–2019), the Illumina sequencing platform was the most common NGS platform used for viral sequencing, with over a 2-fold increase over the next most popular NGS platform (Fig. 1d & e). When long sequences (≥2000 nt) are considered, NGS technologies surpassed Sanger in 2017 as the dominant strategy for sequencing, comprising 53.8% (14,653/27,217) of entries compared to 46.2% of entries (12,564/27,217) for Sanger. This trend held true in 2018 and 2019 as well (Fig. 1f and Supplement Table S2).

Trends and patterns of sequencing technology and assembly methods of viral entries in the GenBank database. a Cumulative frequency histogram of all viral entries in GenBank from Jan. 1, 1982 through Dec. 31, 2019 (total = 2,793,810 entries). b Count of all viral entries with at least one Sequencing Technology documented for the years 1982–2019. For panels (b) and (d), the “Other” category denotes entries with the Sequencing Technology field omitted or mis-assigned. c Relationship between viral entries listing one or two Sequencing Technologies during 1982–2019. The number inside the circle indicates viral entries with only one Sequencing Technology listed the number adjacent to the line indicates entries combining two Sequencing Technologies. The thicker the connection line, the stronger the relationship. d and e Percentage ratio graph of all viral entries with Sequencing Technology documented for the years 2010–2019, with (d) and without (e) the Other category. The majority of entries in earlier years include omissions classified under the Other category, which is detailed in Supplement Table S1. f Percentage ratio graph of viral entries with length greater than 2000 nt that have been documented with one of the seven Sequencing Technologies for the years 2012–2019. The seven technologies include Sanger (n = 1) and NGS technologies (n = 6). g Percentage ratio graph of viral entries with length greater than 2000 nt and that have been documented with one of the six NGS as the Sequencing Technology for the years 2012–2019. Compared to panel (f), Sanger is excluded in this graph. h Assembly method of viral entries greater than 2000 nt, showing percentage ratio graph of entries with at least one Assembly Method. For (h) and (i), the Other category describes assembly methods outside of the 18 most popular programs investigated. i Reclassification of panel (h) by the nature of the assembly methods. The programs can be grouped into de novo assembler, reference-mapping, and software that can perform both

Hybrid sequencing approaches, where researchers use more than one sequencing technology to generate complete viral sequences, have also become more common over the past several years. The most common combination observed was 454 and Sanger (18,124 entries), likely due to the early emergence of the 454 technology compared to other NGS platforms (Fig. 1c and Supplement Table S3). However, combining Illumina with various other sequencing platforms is quite commonplace (> 19,000 entries).

De novo assembly programs (ABySS, BWA, Canu, Cap3, IDBA, MIRA, Newbler, SOAPdenovo, SPAdes, Trinity, and Velvet) have increased from less than 1% of viral sequence entries ≥2000 nt in 2012, to 20% of all viral sequence entries in 2019 (Fig. 1h & i). A similar increase was observed for reference-mapping programs (i.e., Bowtie and Bowtie2), from 0.03% in 2012 to 12.5% in 2019. Multifunctional programs that offer both assembly options were the most common programs cited for the years 2013–2019, but since the exact sequence assembly strategy used for these records is unknown (Tables S1-S5), the contributions of de novo assembly are likely underestimated. An expanded summary of the sequencing technologies and assembly approaches used for viral GenBank records is available in the Supplement text and Supplement Tables S1-S6.

### Effect of variant assembly using popular de novo assemblers

After establishing the growing use of NGS technologies for viral sequencing, we next focused on understanding how the presence of viral variants may influence de novo assembly output. We generated 247 simulated viral NGS datasets representing a continuum of pairwise identity (PID) between two viral variants, from 75% PID (one nucleotide difference every 4 nucleotides), to 99.6% PID (one nucleotide difference every 250 nucleotides) (Fig. 2). For Experiment 1, these datasets were assembled using 10 of the most used de novo assembly programs (Fig. 2 and Supplement Figure S1a) to evaluate their ability to assemble the two variants into their own respective contigs as the PID between the variants increases.

Workflow diagram of the investigation of variant simulated NGS reads through de novo assembly. First, in step 1, an artificial reference genome and corresponding initial variant reads were created with varying constraints such as genome length, GC content, read length, and assemblers, according to the experiment types as detailed in Supplement Figure S1. In the second step, an artificial mutated variant genome was created. The process is repeated to generate 247 different mutated variants with controlled mutation parameters— starting with 1 mutation every 4 nucleotides (75% PID) and ending with 1 mutation in every 250 nucleotides (99.6% PID). Mutated variant reads are also generated for each of the mutation parameters. In the third and fourth steps, the initial and mutated variants were then combined and used as input for de novo assembly for the three experiments, as detailed in Supplement Figure S1

One key observation is that the assembly result can change from two (correct) contigs to many (unresolvable) contigs simply by having variant reads the presence of viral variants affected the contig assembly output of all 10 assemblers tested. The output of the SPAdes, MetaSPAdes, ABySS, Cap3, and IDBA assemblers shared a few commonalities, demonstrated by a conceptual model in Fig. 3a. First, below a certain PID, when viral variants have enough distinct nucleotides to resolve the two variant contigs, the de novo assemblers produced two contigs correctly (Fig. 3). We refer to this as “variant distinction” (VD), with the highest pairwise identity where this occurs as the VD threshold. Above this threshold, the assemblers produced tens to thousands of contigs (Fig. 3), a phenomenon we define as “variant interference” (VI). As PID between the variants continue to increase, the de novo assemblers can no longer distinguish between the variants and assembled all the reads into a single contig, a phenomenon we define as “variant singularity” (VS). (Fig. 3). The lowest pairwise identity where a single contig is assembled is the VS threshold.

. Variant interference in 10 de novo assemblers. a Schematic diagram depicting concepts of the VD, VI, and VS, and their relationship to PID. b Comparison of output from 10 different assemblers. The number of contigs produced by each de novo assemblers at different variant PID ranges (75–99.6%) were shown. c Close-up of PID ranges where variant interference is the most apparent. Blue denotes de Bruijn graph assemblers (DBG) green denotes overlap-layout-consensus assemblers (OLC) orange denotes commercialized proprietary algorithms. Variant distinction, VD variant interference, VI variant singularity, VS. *For SOAPdenovo2, several data points returned zero contigs due to a well-documented segmentation fault error. ﻿The y-axis denotes the number of contigs

Slight differences in the variant interference patterns (relative to the canonical variant interference model) were observed for the 10 assemblers investigated. VD was observed for SPAdes, MetaSPAdes, and ABySS assemblers. While it was not observed with Cap3 and IDBA with the current simulated data parameters, we speculate that VD may occur at a lower PID level for these assemblers than tested in this study. The PID range where VI was observed was distinct for each de novo assembler (Fig. 3). During VI, SPAdes produced as many as 134 contigs and ABySS produced 3076 contigs, while MetaSPAdes, Cap3, and IDBA produced up to 10.

A different pattern was observed for Mira, Trinity, and SOAPdenovo2 assemblers. The average number of contigs generated by Mira, Trinity, and SOAPdenovo2 was 5, 36, and 283, respectively across all variant PIDs from 75 to 99.96%. Specifically, Mira and Trinity generated fewer contigs at low PID, but produced many contigs when the two variants reach 97.1% PID and 96.0% PID, respectively. For SOAPdenovo2, a larger number of contigs were produced regardless of the PID. This indicates that these assemblers generally have major challenges producing a single genome this has been observed in previous studies comparing assembly performance [22].

Finally, Geneious and CLC were the least affected by VI in the simulated datasets tested, returning only 1–5 contigs for all pairwise identities. CLC’s assembly algorithm primarily returned a single contig over the range of PIDs tested (218/247 simulations 88.3%), thus favoring VS. In comparison, Geneious predominantly distinguished the two variants (234/247 simulations 94.7%), favoring VD.

### Effect of GC content and genome length on variant assembly

For Experiment 2, we focused our study on evaluating whether VI observed in SPAdes de novo assembly is influenced by the GC content or genome length of the pathogen. SPAdes was chosen because it produced a well-defined variant interference that closely resembled the conceptual model (Fig. 2). It is also one of the leading assemblers for viral assembly (Fig. 1), possibly due to its ability to assemble viral variants without variant interference in most PID. Two datasets were used for the evaluation: reads generated from four artificial genomes ranging in length from 2 Kb to 1 Mb, as well as from genome sequences of poliovirus (NC_002058 7440 nt in length) and coronavirus (NC_002645 27,317 nt in length). No discernable correlation was observed between the GC content of variant genomes and the degree of VI for any of the simulated datasets (Supplemental Figure S1, p < 0.0001). Therefore, for subsequent analyses examining the effects of genome length on VI, the number of contigs at each PID level was obtained by averaging the 13 GC simulations.

Notably, no matter the genome length, SPAdes produced vastly more contigs (i.e., VI) in a constant, narrow range of PID (99–99.21% Fig. 4a & b). The effect of variants on assembly was characterized by the three distinct intervals described previously: VD at lower PIDs, VI (Fig. 4b), and VS at higher PIDs for all genome lengths. For example, during VS, a single contig was generated when the two variants shared ≥99.22% PID, but tens to thousands of contigs were generated at a slightly lower PID of 99.21%. This PID threshold, 99.21%, marked the drastic transition from VI to VS, whereas the transition from VD to VI (i.e., the VD threshold) occurred at 98.99% PID (Fig. 4b). A correlation was observed between genome length and the number of contigs produced during VI, where longer genomes returned proportionally more contigs as expected as total VI occurrence should increase with length (r 2 = 0.967 p < 0.0001 Fig. 4b and c).

The effect of genome length and read length on de novo assembly of simulated variants across a range of percentage identities (PID). a & b Comparison of genome lengths. Six different genome lengths were assembled and the final contig counts were tallied across varying PID thresholds (75–99.6%). For the simulated genome lengths of 2Kb, 10 kb, 100 Kb, and 1 Mb, the average of contig number at each PID was plotted. Panel (b) shows the close-up view where interference was the most prominent. For all six genome lengths and each of the 13 iterations, VI consistently occurred in the same range of PID (99.00–99.24%). The assembly makes a transition from VD to VI at the threshold of 99.00%, and it makes a transition from VI to VS at the threshold of 99.24%. Also, the longer the genome length, the more contigs produced during VI. c The relationship between genome length and the total number of contigs produced. Data from panel (a) were plotted on a logarithmic scale. The total number of contigs produced is significantly dependent on the genome size (r 2 = 0.967 p-value< 0.0001). d and e The effect of read length in variant assembly with a genome size of 100 K. Simulated data with four different read lengths were created and assembled, and the final contig counts were tallied across varying PID thresholds (75–99.6%). Panel (e) shows the close-up view where interference was the most apparent. When longer read lengths were used, the variant interference PID range was much narrower than when shorter read lengths were used to build contigs

### Effect of read length on variant assembly

The read length of a given NGS dataset will vary depending on the sequencing platform and kits utilized to generate the data. Since read length is an important factor for de novo assembly success, [23] we hypothesized that it may also influence the ability to distinguish viral variants. For Experiment 3, using SPAdes we investigated assemblies with four typical read lengths: 50, 100, 150, and 250 nt. At longer read lengths, the VD threshold occurred at higher PIDs (Fig. 4d & e). Also, with increasing read length, the width of the PID window where VI occurs gradually decreased from a 1.52% spread to a 0.21% spread (Fig. 4e). This indicates that longer reads are better for distinguishing viral variants with high PIDs.

### In silico experiments examining variant assembly with NGS data derived from clinical samples

For clinical samples, assembly of viral genomes is affected by multiple factors other than the presence of variants, including sequencing error rate, host background reads, depth of genome coverage, and the distribution (i.e., pattern) of genome coverage. We next utilized viral NGS data generated from four picornavirus-positive clinical samples (one coxsackievirus B5, one enterovirus A71, and two parechovirus A3) to explore VI in datasets representative of data that may be encountered during routine NGS. The NGS data for each sample was partitioned into four bins of read data: (1) total reads after quality control (T) (2) major variants only (M) (3) major and minor variants only (Mm) and (4) major variants and background non-viral reads only (MB) (Fig. 5). These binned datasets were then assembled separately using three assembly programs: SPAdes, Cap3, and Geneious. These programs were chosen as representatives of different assembly algorithms: SPAdes is a leading de Bruijn graph (DBG) assembler, Cap3 is a leading overlap-layout-consensus (OLC) assembler, and Geneious is a proprietary software. By comparing these manipulations, we aimed to test the hypothesis that minor variants directly affect the performance of assembly through VI in real clinical NGS data.

The effect of variant interference in a real dataset from a clinical sample containing enterovirus A71 (EV-A71) and its variants. Fastq reads were partitioned into four components: trimmed reads after quality control (T), major variant (M), minor variant (m), and background (B). These reads were then combined into four different experiments: T, M, Mm, and MB and assembled using SPAdes. The contig representation schematic showing the abundance and length of the generated contigs reveals the impact of variant interference on de novo assembly. The bar graphs show the UG50% metric and the length of the longest contig. UG50% is a percentage-based metric that estimates length of the unique, non-overlapping contigs as proportional to the length of the reference genome [24]. Unlike N50, UG50% is suitable for comparisons across different platforms or samples/viruses. More clinical samples and viruses are analyzed similarly in Fig. 6

Even with an adequate depth of coverage for genome reconstruction, assembly of total reads (T) in 11/12 experiments resulted in unresolved genome construction – resulting in numerous fragmented viral contigs (Fig. 6). The only exception was one experiment where one single PeV-A3 (S1) genome was assembled using Cap3. When only reads from the major variant were assembled (M), full genomes were obtained for all datasets using SPAdes and Cap3, and for the CV-B5 sample using Geneious. Conversely, assembly of the read bins containing major and minor variants (Mm) resulted in an increased number of contigs for 9 of the 12 sample and assembly software combinations tested (Fig. 6), indicating that VI due to the addition of the minor variant reads likely adversely affected the assembly. The presence of background reads with major variant reads (MB) did not appear to affect viral genome assembly, as the UG50% value, a performance metric which only considers unique, non-overlapping contigs for target viruses [24], was similar between M and MB datasets.

The effect of variant interference on the assembly of four clinical datasets using three assembly programs. Fastq reads were partitioned into four categories: total reads (T), major variant (M), minor variant (m), and background (B). These reads were then combined into four different categories: T, M, major and minor variants (Mm), and major variant and background (MB). Datasets were assembled using SPAdes, Cap3, and Geneious. The bar graphs show the UG50% metric and the length of the longest contig. Coxsackievirus B5, CV-B5 Enterovirus A71, EV-A71 Parechovirus A3 (Sample 1), PeV-A3 (S1) Parechovirus A3 (Sample 2), PeV-A3 (S2)

## Electronic supplementary material

### 12918_2011_782_MOESM1_ESM.PDF

Additional file 1: Work flow of the reconstruction and reduction processes. The work flow of the reconstruction was performed similarly to the protocol recommended for the generation of high-quality reconstruction (Thiele & Palsson, 2010). (PDF 302 KB)

### 12918_2011_782_MOESM5_ESM.XLSX

Additional file 5: Table of substrate usage by M. extorquensAM1 from experimentally observed phenotype and Flux Balance Analysis using the genome scale network (iRP911). (XLSX 15 KB)

### Electron flow through the metabolic network of

Additional file 6: M. extorquens AM1. The schemas represent the reactions involved in electron flow in M. extorquens AM1like it appear from the network reconstruction (iRP911). Detail on the reaction, given with identifiers of the type R-XXXX, can be found in the Additional file 2. (PDF 25 KB)

### 12918_2011_782_MOESM7_ESM.XLSX

Additional file 7: Methylotrophic network. The reactions of the GS network were included or excluded from the methylotrophic network based on multi-criteria analysis. The list of considered criteria is given here as well as the score for each reaction of the GS network. The list of reactions included in the methylotrophic network are indicated as '1' in the 'methylotrophic network' column. (XLSX 152 KB)

### 12918_2011_782_MOESM8_ESM.XLSX

Additional file 8: EFM analysis of the primary assimilation pathways and connectivity of biomass precursors. The EFMs were calculated for the conversion of methanol into 13 key carbon precursors. The main properties of the calculated EFMs are given. (XLSX 11 KB)

### Reaction essentiality in the methylotrophic network

Additional file 9: . The graph displays the essentiality or dispensability of reactions and showx the experimental evidence for gene essentiality. Reaction essentiality was analyzed using Minimal Cut Set calculation [42] applied to the set of EFMs [30] allowing biomass production from methanol. Fragility Coefficients (FCs) were calculated from the MCSs [42]. Reactions having a FC of 1 were identified as essential (red arrows), and reactions with a FC < 1 were considered as dispensable (blue arrows). The enzyme(s) catalyzing the network reactions are represented by boxes. The experimental phenotypes of mutants affected for these enzymes are displayed using a color code: red box: lethal phenotype, blue box: non-lethal phenotype, black box: no experimental data available. The accuracy of the model prediction is indicated upper the bar for each class of reaction. (PDF 167 KB)

### 12918_2011_782_MOESM13_ESM.XLSX

Additional file 13: Flux distributions and sensitivity analysis. Only C1 assimilation was considered during flux calculation due to the high difference in range of C1dissimilation and assimilation. Measured methanol uptake rate was considered subsequently. (XLSX 23 KB)

### 12918_2011_782_MOESM14_ESM.PDF

Additional file 14: Quality of isotopomer fitting. Comparison of experimental and collected isotopomer values for the three biological replicates. The isotopomer data include both LC-MS and 2D-NMR (HSQC and TOCSY) data. Flux calculation and fitting were performed using the software 13CFlux (Wiechert et al, 2001). A) Experimental values (+/- standard deviation) are plotted against theoretical values. B) Residuum of the calculated data. (PDF 385 KB)

### 12918_2011_782_MOESM15_ESM.PDF

Additional file 15: Flux variability in the 3 biological replicates. Comparison of the flux distribution obtained for the three biological replicates. The flux calculation and the sensitivity analysis were performed using the software 13CFLUX (Wiechert et al, 2001). The fluxes were normalized by the flux of entry of the C1-units in central metabolism (SHMT: serine hydroxymethyltransferase). Flux distributions were found to be similar except slight changes through the C3/C4 interconversions (pyruvate kinase (PK), pyruvate dikynase (PPDK), malic enzyme (ME) and the phosphoenolpyruvate carboxykinase (PEPCK)), and through the Entner-Doudorof pathway. (PDF 207 KB)

### 12918_2011_782_MOESM16_ESM.XML

Additional file 16: Computer model written in Systems Biology Markup Language of the genome-scale metabolic network iRP911. (XML 735 KB)