Information

What tool to use to map reads (samtools, sequencher,… )?


I am a beginner in Bioinformatics. I have 4 files:

2 fastq files (aln1.fastq and aln2.fastq)

2 bam files (aln.bam and aln.bam.bai)

I know that:

  • the raw file is aligned to hg19 human genome.
  • The sequenced file is pair-ended from Miseq sequencing platform.
  • This file is the result of an amplicon sequencing design.

I also have sequence information about the primers (4 forward and 4 reverse) and 2 adaptors.

And I have to answer these two questions:

  1. What is the constitution of each read, for example (adaptor + primer + amplified region) ?
  2. Which gene region did those reads mapped to ?

My question is what kind of tool do I have to use to respond to these questions? samtools? sequencher? FastQC? Can anybody help me just to start because I am totally lost.

Thanks


Thealn.bamfile is likely a binary and compressed version of a file in SAM format, indicating where reads have aligned. I guess it has been generated by some alignment program using the two fastq files.

For a start, you might use a program to visualize where the reads have mapped.

For instance, you may use IGV and load youraln.bamfile to see where in hg19 they aligned. IGV will also use thealn.bam.baifile to help it find the aligned reads in thealn.bamfile.

As far as I know, sequencher is used to clean and assemble Sanger sequencing data, but my experience with that program is almost 10 years old.

Samtools is a command-line program that can help you extract information from the aln.bam file. You may use it to count how many alignments are present in a given genomic region, for instance, using option-cofsamtools view. See the manual for more details: http://www.htslib.org/doc/samtools.html


Samtools is not an aligner; it is used to analyse alignments. FastQC is for analysing the read qualities and average composition. You can get to know the adaptor sequence by running FastQC.

The next step is adaptor trimming. There are many tools that do that. Trimmomatic is currently popular. You can find the adaptor composition of each read by subtracting the read length post trimming from the read length pre-trimming (this would be same for all reads).

Then you have to align your trimmed reads to a reference, using an aligner. Again, there are many aligners available; Bowtie, STAR and now a new one called HISAT. STAR and HISAT are faster than Bowtie.

Since you already have the alignments (aln.bam) to hg19, you do not need to perform alignment. You want to know, to which genes do these reads map to. This information is not available unless you have the genome annotations. The annotations are available in the form of GTF/GFF files. You can obtain the association between genomic locations and the reads using theBEDOPStoolkit as mentioned in this Biostars post.

I don't have much experience with this tool. What I would do is convert theBAMtoSAM(using samtools) or toBED(using bedtools:bamtobed) and parse the columns of theSAM/BEDfile describing the read locations with theGTFfile. See this link for the details on the SAM format. You can use any scripting language likeawkorperlto parse.

Samtools does have an option to filter the reads according to regions specified in theBEDformat but it will not automatically annotate them.

samtools view -hL regions.bed aln.bam > aln_filtered.sam

SNP Analysis

Screening the millions of reads that next-generation sequencing produces presents a major challenge when searching for candidate SNPs. Both the Maq and GSNAP algorithms include SNP screening capabilities.

Maq(1) has a two level screening process which searches initially for differences between the reads and the reference sequence. Next a filtering step takes place which sifts the initial results looking for a minimum number of variants of the same class per column of reads, and variants embedded within high quality regions. The results are presented in a comprehensive report. You can also view your results in a Maqview or Tablet.

With GSNAP(2) the SNP analysis takes a different approach looking at both previously reported SNPs as well as new candidates. The user must supply a list of known SNPs as well as the reads and a reference sequence. GSNAP performs a SNP-tolerant alignment of all major and minor alleles. The algorithm enables the minor alleles to be differentiated from mismatches. The results are presented in a comprehensive report. You can also view your results in Tablet.

(1) Heng Li, Jue Ruan and Richard Durbin
Mapping short DNA sequencing reads and calling variants using mapping quality scores
Genome Research 2008 18:1851-1858

(2) Thomas D. Wu and Serban Nacu
Fast and SNP-tolerant detection of complex variants and splicing in short reads
Bioinformatics 2010 26: 873-881


Samtools & Integrated Genomics Viewer

Samtools will convert .sam files that were generated in the previous steps to .bam files. This site gives step by step instructions for converting from sam to bam, sorting and indexing a .bam.

One method of visualizing the reads in a .bam file is the IGV from Broad. I’m not going to go into details, but the idea is:

  1. Choose the same reference genome that you used to index your reference genome from the previous steps.
  2. Upload your .bam file (index file “.bai” must be in same folder as .bam that you upload)
  3. Type in the gene of interest in the search bar.
  4. Zoom in to see the reads, they aren’t visible at low power.

Every computer is set up a little differently invoking different usage and calls to the software. Comments and suggestions welcome!


A Working with AppleScript tutorial and some sample AppleScript scripts are installed with Sequencher, and are also available on this website. The tutorial and scripts give you a glimpse into the potential of AppleScript and how it can help you get more from Sequencher. Click here to download the AppleScript samples.

As of the winter of 2006, Gene Codes distributes Green, Blue, and Yellow SuperPro keys. We will no longer ship the Purple Eve3 and the Red SuperPro key.

Key Function Supported Software
Green or black with green stripe Network SuperPro Key Sequencher License Server 6.1 and above
Blue or black with blue stripe Standalone SuperPro Key Mac Sequencher 4.7 and above
Yellow or black with yellow stripe Standalone SuperPro Key Windows Sequencher 4.0 and above
Red NW label Network SuperPro Key Windows Sequencher License Server 5.2 and above
Purple NW label Network Eve3 Key Mac Sequencher License Server 5.2 to 6.0.3
Purple SA label Standalone Eve3 Key Mac Sequencher 3.1 to 4.7


Results and discussion

Using bio-samtools: a brief tutorial

bio-samtools in use is straightforward, here are a few examples of interacting with BAM files with the package. More information on specific functions is provided in the RubyDoc documentation and in the files bioruby-samtools/doc/tutorial.html and bioruby-samtools/doc/tutorial.pdf. The location of the bio-samtools installation folder can be found by typing ’gem which bio-samtools’ at the command-line.

Installation

bio-samtools is easily installed from a machine with an internet connection and a Ruby installation with the straightforward Gem invocation ’gem install bio-samtools’. bio-samtools automatically downloads the original libbam C source code and compiles it for Linux or OSX as appropriate. The new version of the library is kept locally to the bio-samtools code to avoid conflicts with other installations of the library.

Loading a BAM file

A SAM object represents the alignments in the BAM file, and is very straightforward to create, you will need a sorted BAM file, to access the alignments and a reference sequence in FASTA format to use the reference sequence. The object can be created and opened as follows: require 'bio-samtools' bam=Bio::DB::Sam.new(:bam=>"my_sorted.bam", :fasta=>'ref.fasta') bam.open bam.close

Opening the file needs only to be done once for multiple operations on it, access to the alignments is random so you don’t need to loop over all the entries in the file, as you would with a manual SAM file parse.

Getting summary information

The length of reference sequences and the number of reads mapped to each can be obtained with the index_stats function. A Hash object, keyed by reference name and with a Hash at each value is returned. The Hash at the value has keys :length, :mapped_reads and :unmapped_reads and values for each of these. The index_stats function wraps the SAMtools idxstats command. sam.index_stats # returns < "chr_1"=> <:length=>69930, :mapped_reads=>1000, :unmapped_reads=>0 >, >

Retrieving reference sequence

Retrieving the reference can only be done if the reference has been loaded, which isn’t done automatically in order to save memory. Reference need only be loaded once, and is accessed using reference name, start, end in 1-based co-ordinates. A standard Ruby String object is returned. In this example a 500 nucleotide region from the start of the sequence is returned. bam.load_reference seq = bam.fetch_reference("Chr1", 1, 500)

Retrieving alignments in a region

Alignments in a region of interest can be obtained one at a time by giving the region to the fetch() function. bam.fetch("Chr1", 3000, 4000).each do | alignment | puts alignment.qname #do something with the alignment object end

Get a summary of coverage in a region

It is easy to get the total depth of reads at a given position, the chromosome_coverage function is used. This differs from the previous functions in that a start position and length (rather than end position) are passed to the function. An array of coverages is returned, eg [26,26,27 .. ]. The first position in the array gives the depth of coverage at the given start position in the genome, the last position in the array gives the depth of coverage at the given start position plus the length given. coverages = bam.chromosome_coverage("Chr1", 3000, 1000) Similarly, average (arithmetic mean) of coverage can be retrieved, also with start and length parameters av_cov = bam.average_coverage("Chr1", 3000, 1000)

Getting pileup information

Pileup format represents the coverage of reads over a single base in the reference. Getting a Pileup over a region is very easy. Note that this is done with mpileup and NOT the now deprecated and removed from SAMTools pileup function. Calling the mpileup method creates an iterator that yields a Pileup object for each base. bam.mpileup do |pileup| puts pileup.consensus end

The mpileup function takes a range of parameters to allow SAMTools level filtering of reads and alignments. They are specified as key, value pairs. In this example a region is specified by :r and a minimum per base quality score is specified by :Q. bam.mpileup(:r => "Chr1:1000-2000", :Q => 50) do |pileup| puts pileup.coverage end

Not all the options SAMTools allows you to pass to mpileup are supported, those that cause mpileup to return Binary Variant Call Format (BCF) [13] are ignored. Specifically these are g,u,e,h,I,L,o,p. Table 4 lists the SAMTools flags supported and the symbols you can use to call them in the mpileup command.


How to extract unmatched reads using bwa and samtools?

I have a single read (NOT paired) that I need to pass through the workflow described in Beauclair et al. paper (free version here https://rnajournal.cshlp.org/content/24/10/1285.long) for identifying defective genomes using their DI-tector program.

Here in Materials and methods the procedure is described as the following:

The first step of the workflow consists of an alignment of the reads against the host genome (Fig. 2i). This step aims at discarding reads that map to the host genome and may partially map to the viral genome after segmentation, and at reducing the working file size. For example, MV and rMV-ΔV data sets were generated from total RNA samples of infected cells and contained mostly reads mapping the human genome (≈99%). This step uses a combination of bwa mem and samtools view with the parameters –bS –f4 . An additional step consists of an alignment of the reads against the viral genome of interest, in order to exclude reads perfectly mapping the viral genome. Therefore, only unmapped reads are further analyzed. Of note, clipped reads (i.e., CIGAR motif contain S or H) are also conserved. Some of these reads may map to viral genome recombination junctions that are present in DI genomes.

I have already been suggested to use bowtie2 instead of bwa but, firstly, the output is not clear for me and secondly, I would like to test the official protocol.

Since the article suggests to use bwa and samtools on this very first step, that is what I have done so far:

(Optional) not sure if this is important but as suggested by someone I have transformed .fna --> .fa

Indexed human genome with bwa

aligned my only read with indexed genome from step 1

Separated unmapped reads (as it is recommended in Materials and Methods using -f4 )

Converted unmapped reads into .fastq format (since this is the format used by the software later)


Step 6: Export BAM file as a table

From the Graphical Sequence Viewer, zoom to the desired location and select a range of interest.

Right-click the selected range and click the Export Data option in the context menu.

An alignment export menu will be opened. Note that BAM files are stored as alignments, so you need to select “Alignment Table File” in the list on the left. Select the desired location in the main section. Name the target file. If you need to change the default export location use the folder button. Click the Next button.

In the next screen, select fields from the alignment file to export.

Click the Finish button. Your file will be exported.

The exported file can be opened in a spreadsheet program like Excel for further use.


Acknowledgements

We thank our colleagues Prof Bill Rawlinson and A/Prof Rowena Bull, who oversaw collection and processing of SARS-CoV-2 patient isolates for a separate study. Thanks also to Chandima Samarasinghe, Harshana Weligampola, Nirodha Suchinthana, Rahal Medawatte and Yasiru Ranasinghe for providing valuable feedback after testing Genopo. We acknowledge the following funding sources: MRFF grant APP1173594 (to I.W.D.) and Cancer Institute NSW Early Career Fellowship 2018/ECF013 (to I.W.D.) and philanthropic support from The Kinghorn Foundation (to I.W.D. and H.G.). The contents of the published materials are solely the responsibility of the participating or individual authors, and they do not reflect the views of any funding body listed.


Background

DNA methylation is an important epigenetic modification involved in gene silencing, tissue differentiation, and cancer [1]. High-resolution, genome-wide measurement of DNA methylation is now possible using whole-genome bisulfite sequencing (WGBS), a process whereby input DNA is treated with sodium bisulfite and sequenced. While WGBS is comprehensive, it is also quite costly [2]. For instance, an application of WGBS by Lister et al. [3] compared DNA methylation profiles of an embryonic stem cell line and a fibroblast cell line. Both were sequenced to about 30× coverage (25× coverage of all CpGs), requiring 376 total lanes of bisulfite sequencing on the Illumina GA II instrument. While conventional wisdom is that 30× coverage or deeper is needed to achieve accurate results, advanced statistical techniques proposed here, such as local likelihood smoothing, can reduce this requirement to as little as 4×.

It has also been shown that different genomic regions exhibit different levels of DNA methylation variation among individuals [4]. As a consequence, regions that are inherently variable can easily be confused with regions that differ consistently between groups when few replicates are available [1] (Figure 1). But performing WGBS on the number of biological replicates required to overcome such issues can be quite expensive. The techniques proposed here address this issue both by making full use of replicate information during analysis, and by potentially reducing the coverage needed for (and therefore the cost of) replication.

The need for biological replicates. We show smoothed methylation profiles for three normal samples (blue) and matched cancers (red) from the Hansen data [1]. Also shown is the smoothed methylation profile for an IMR90 cell-line (black) from the Lister data [3]. Had we only analyzed normal-cancer pair 3 (thick lines), there would appear to be a methylation difference between cancer and normal in this genomic region. When all three cancer-normal pairs are considered, however, this region does not appear to be a cancer-specific differentially methylated region.

Analysis of WGBS data starts with alignment of bisulfite converted reads. After alignment, statistical methods are employed to identify differentially methylated regions (DMRs) between two or more conditions. Extensive work has been dedicated to alignment [5–10] but methods for post-alignment analysis are limited. Published work based on WGBS has relied on a modular approach that first identifies differentially methylated CpGs that are then grouped into regions using ad hoc grouping rules. The first step is carried out using either Fisher's exact test [3, 11–13], arbitrary cutoffs for differences in observed methylation levels [14], or a beta-binomial model [15]. None of these methods take biological variability into account. To the best of our knowledge, no software is available implementing these approaches.

Here we present BSmooth, a comprehensive analysis tool for WGBS datasets. The BSmooth pipeline begins with an unbiased and bisulfite-aware read alignment step, compiles quality assessment metrics based on stratifying methylation estimates by read position, applies local averaging to improve precision of regional methylation measurements, and detects DMRs accounting for biological variability when replicates are available. The main methodological contribution of BSmooth is the ability to identify DMRs accounting for biological variability, as well as the quality control measures we propose. In addition, BSmooth includes a new aligner, Merman, which appropriately handles colorspace. We demonstrate the benefits of BSmooth with four publicly available datasets: the Lister data [3], the Hansen data [1], the Hansen-capture data [1] and the Tung data [16] (see Materials and methods for details). We use these data to demonstrate the advantages of BSmooth over existing algorithms based on Fisher's exact test. BSmooth is the first pipeline for WGBS datasets yielding DMRs as output, while also taking biological variation into account. It can handle low-coverage experimental designs, allowing researchers to profile several samples at the same cost as a high-coverage profile of a single sample.


What tool to use to map reads (samtools, sequencher,… )? - Biology

Bismark is a program to map bisulfite treated sequencing reads to a genome of interest and perform methylation calls in a single step. The output can be easily imported into a genome viewer, such as SeqMonk, and enables a researcher to analyse the methylation levels of their samples straight away. It's main features are:

  • Bisulfite mapping and methylation calling in one single step
  • Supports single-end and paired-end read alignments
  • Supports ungapped and gapped alignments
  • Alignment seed length, number of mismatches etc. are adjustable
  • Output discriminates between cytosine methylation in CpG, CHG and CHH context

Bismark is now also available from GitHub. You are invited to leave comments, feature request or bug reports over there!

This link will take you to the Bismark publication.

This link will take you to our review about primary data analysis in BS-Seq.

This link will take you to our protocol 'Quality Control, trimming and alignment of Bisulfite-Seq data' at the Epigenesys website.

Here you can access the Bismark documentation Bismark User Guide.

Here is an overview of the alignment modes that are currently supported by Bismark: Bismark alignment modes (pdf).


Watch the video: ngs sam samtools (December 2021).