How can I view modENCODE data faster?

I am trying to view several data tracks in the modENCODE GBrowse genomic browser. However, the site is so slow, it is practically unworkable. Is there a faster way to explore the data?


  • You can always download the data (as @WYSIWYG has suggested in the comment).

  • Transfer the data to Galaxy

  • Convert and export the selection of your choice.


Blinded by Big Science: The lesson I learned from ENCODE is that projects like ENCODE are not a good idea

When the draft sequence of the human genome was finished in 2001, the accomplishment was heralded as marking the dawn of the age of “big biology”. The high-throughput techniques and automation developed to sequence DNA on a massive scale would be wielded to generate not just genomes, but reference data sets in all areas of biomedicine.

The NHGRI moved quickly to expand the universe of sequenced genomes, and to catalog variation within the human population with HapMap, HapMap 2 and 1000 genomes. But they also began to dip their toe into the murkier waters of “functional genomics”, launching ENCODE, a grand effort to build an encyclopedia of functional elements in the human genome. The idea was to simultaneously annotate the human genome and provide basic and applied scientists working on human disease with reference data sets that they would otherwise have had to generate themselves. Instead of having to invest in expensive equipment and learn complex protocols, they would often be able to just download the results, thereby making everything they did faster and better.

Now, a decade and several hundred million dollars later, the winding down of ENCODE and the publication of dozens of papers describing its results offer us a vital opportunity to take stock in what we learned, if it was worth it, and, most importantly, whether this kind of project makes sense moving forward. This is more than just an idle intellectual question. NHGRI is investing $130m in continuing the project, and NHGRI and the NIH as a whole, have signalled their intention to do more projects like ENCODE in the future.

I feel I have a useful perspective on these issues. I served as member of the National Advisory Committee for the ENCODE and related modENCODE projects throughout their lifespans. As a postdoc with Pat Brown and David Botstein in the late 90’s I was involved in the development of DNA microarrays and had seen first hand the transformative potential of genome sequences and the experimental genomic techniques they enabled. I believed then, and still believe now, that looking at biology on a big scale is often very helpful, and that it can make sense to let people who are good at doing big projects, and who can take advantage of economies of scale, generate data for the community.

But the lesson I learned from ENCODE is that projects like ENCODE are not a good idea.

American biology research achieved greatness because we encouraged individual scientists to pursue the questions that intrigued them and the NIH, NSF and other agencies gave them the resources to do so. And ENCODE and projects like it are, ostensibly at least, meant to continue this tradition, empowering individual scientists by producing datasets of “higher quality and greater comprehensiveness than would otherwise emerge from the combined output of individual research projects”.

But I think it is now clear that big biology is not a boon for individual discovery-driven science. Ironically, and tragically, it is emerging as the greatest threat to its continued existence.

The most obvious conflict between little science and big science is money. In an era when grant funding is getting scarcer, it’s impossible not to view the $200m spent on ENCODE in terms of the

125 R01’s it could have funded. It is impossible to score the value lost from these hundred or so unfunded small projects against the benefits of one big one. But a awful lot of amazing science comes out of R01’s, and it’s hard not to believe that at least one of these projects would have been transformative.

But, as bad as the loss of individual research grants is, I am far more concerned about the model of independent research upon which big science projects are based.

For a project like ENCODE to make sense, one has to assume that when a problem in my lab requires high-throughput data, that years in advance, someone – or really a committee of someones – who has no idea about my work predicted precisely the data that I would need and generated it for me. This made sense with genome sequences, which everyone already knew they needed to have. But for functional genomics this is nothing short of lunacy.

There are literally trillions of cells in the human body. Multiply that by life stage, genotype, environment and disease state, and the number of possible conditions to look at is effectively infinite. Is there any rational way to predict which ones are going to be essential for the community as a whole, let alone individual researchers? I can’t see how the answer is possibly yes. What’s more, many of the data generated by ENCODE were obsolete by the time they were collected. For example, if one were starting to map transcription factor binding sites today, you would almost certainly use some flavor of exonuclease ChIP, rather than the ChIP-seq techniques that dominate the ENCODE data.

I offer up an example from my own lab. We study Drosophila development. Several years ago a postdoc in my lab got interested in sex chromosome dosage compensation in the early fly embryo, and planned to use genome-wide mRNA abundance measurements in male and female embryos to study it. It just so happened that the modENCODE project was generating genome-wide mRNA abundance measurements in Drosophila embryos. Seems like a perfect match. But these data was all but useless to us, not because the data weren’t good – the experiment was beautifully executed – but because their data could not answer the question we were pursuing. We needed sex-specific expression they pooled males and females. We needed extremely precise time resolution (to within a few minutes) they looked at two hour windows. There was no way they could have anticipated this – or any of the hundreds of other questions about developmental gene expression that came up in other labs.

We were fortunate. I have money from HHMI and was able to generate the data we needed. But a lot of people would not have been in my position, and in many ways would have been worse off because the existence of ENCODE/modENCODE makes it more difficult to get related genomics projects funded. At this point the evidence for such an effect is anecdotal – I have heard from many people that reviewers explicitly cited an ENCODE project as a reason not to fund their genomics proposal – but it’s naive to think that these big science projects will not affect the way that grants are allocated.

Think about it this way. If you’re an NIH agency looking to justify your massive investment in big science projects, you are inevitably going to look more favorably on proposals that use data that has already, or is about to be, generated by expensive projects that feature in the institute’s portfolio. And the result will be a concentration of research effort on datasets of high technical quality, but little intrinsic value, with scientists wanting to pursue their own questions left out in the cold, and the most interesting and important questions at risk of never being answered, or even asked.

You can already see this mentality at play in discussions of the value of ENCODE. As I and many others have discussed, the media campaign around the recent ENCODE publications was, at best, unseemly. The empty and often misleading press releases and quotes from scientists were clearly masking the fact that, despite publishing 30 papers, they actually had very little of grand import to say, today, about what they found. The most pensive of them realized this, and went out of their way to emphasize that other people were already using the data, and that the true test was how much the data would be used over the coming years.

But this is the wrong measure. These data will be used. It is inevitable. And I’m sure this usage will be cited often to justify other big science projects ad infinitum. And we will soon have a generation of scientists for whom an experiment is figuring out what kinds of things they can do with data selected three years earlier by a committee sitting in a windowless Rockville hotel room. I don’t think this is the model of science anyone wants – but it is precisely where we are headed if the metastasis of big science is not amended.

I want to be clear that I am not criticizing the people who have carried out these projects. The staff at the NIH who ran ENCODE, and the scientists who carried it out worked tirelessly to achieve its goals, and the organizational and technical feat they achieved is impressive. But that does not mean it is ultimately good for science.

When I have raised these concerns privately with my colleagues, the most common retort I get is that, in today’s political climate, Congress is more willing to fund big, ambitious sounding projects like ENCODE than they are to simply fund the NIH extramural budget. I can see how this might be true. Maybe the NIH leadership is simply feeding Congress what they want in order to preserve the NIH budget. And maybe this is why there’s been so little push back from the general research community against the expansion of big biology.

But it will be a disaster if, in the name of protecting the NIH budget and our labs’ funding, we pursue big projects that destroy investigator driven science as we know it in the process.

How can I view modENCODE data faster? - Biology

Total RNA was isolated using Trizol, then Qiagen RNeasy spin columns. mRNA was isolated using Dynal oligo(dT) beads, then fragmented using divalent cations under elevated temperature, followed by sequential ligation of RNA linkers to the 5’ and 3’ ends. Next, reverse transcription was performed using a primer complementary to the 3’ linker and PCR was performed using primers complementary to both linkers.

300 bp fragments were isolated from an agarose gel and gel-purified again.

The samples were quantitated using a Nanodrop, and loaded onto a flow cell for cluster generation and sequenced on an Illumina Genome Analyzer II using either single read or paired end protocols (Illumina).

Reads were aligned to Dmel_Release_6 using the STAR aligner v2.3.0e (Linux x86_64) with default parameters on the FASTQ files to generate multiply-mapped BAM files. These were filtered to include reads with only 1 aligned hit ( NH:i:1 attribute) to generate uniquely-mapped BAM files. A custom script was used to convert BAM files into bedgraph files (

[modENCODE Tissues Profile](

The RNA-seq profiles displayed by FlyBase in GBrowse and used for RPKM calculation can be accessed at the FTP link below as .wig files. Please take note of how these FlyBase .wig files represent data for a contiguous sequence of bases with the same signal value. The value is declared only for the first position of that region, and applies to all positions that follow (these are not explicitly listed) until a new value at a new base position is declared.


To determine copy number genome-wide, we performed next generation DNA sequencing (DNA-Seq) on naked DNA harvested from 19 modENCODE cell lines [32–41] and control DNA from adult females (Table 1). We then mapped the sequence reads to release 5 of the D. melanogaster reference genome to identify the relative copy number of each gene. In two cases, we resequenced libraries made from independent cultures, grown in different labs (S2-DRSC and Cl.8) to assay copy number stability, and found excellent agreement. For the Cl.8 line, we found that the overall genome copy number structure was 99.6% identical. For the highly rearranged S2-DRSC line, we observed 87.2% copy number agreement between two independent cultures, suggesting that even these highly aberrant copy number states are relatively stable. Below, we describe the structure of these genomes in order of degree of copy number change.

Ploidy of cell lines

We first determined basal genome ploidy status from ratiometric DNA-Seq data. We took advantage of the extensive copy number deviations in the cell lines to make this determination. In our DNA-Seq analysis of the cell lines, we set the mean peak of DNA-Seq read count density at ‘1’ to reflect the relative nature of the measurements and plotted X-chromosome and autosomal DNA-Seq densities separately (Figure 1). DNA density ratios from different copy number segments can be represented as fractions with a common denominator and the smallest such denominator indicates the minimum ploidy. One good illustration was the S1 cell line. We observed a DNA-density peak at 1.47 from DNA-Seq of S1 cells, suggesting that a segmental duplication of autosomal DNA occurred in this line (approximately 50% increase) on a baseline diploid karyotype, since there was no DNA block with intermediate DNA content between approximately 1.5 and 1. Another example is Kc167 cells, which had at least four levels of relative read-count ratios centered on 0.58, 0.77, 1.03 and 1.29. This distribution of DNA densities was consistent with tetraploidy. In the majority of cases, this simple analysis yielded a clear ploidy estimate. We scored BG3-c2, Cl.8, D20-c2, D20-c5, D4-c1, L1, S1, W2, and D8 cell lines as minimally diploid, and S2-DRSC, S2R+, S3, Sg4, Kc167, D16-c3, and D17-c3 cell lines as minimally tetraploid. Our results for D9 and mbn2 cell line ploidy were inconclusive, due to the presence of multiple regions of relative read densities that were not ratios of whole numbers.

Cell line ploidy by DNA-Seq. Histograms of normalized DNA read density of 1 kb windows. Red, reads from X chromosomes black, reads from autosomes blue, centers of individual peak clusters gray, peak cluster ratios. #1 and #2 indicate the results from two independent sets of DNA-Seq from different labs.

Ratiometric DNA-Seq data allowed us to determine minimal ploidy, but not absolute ploidy. Therefore, we also examined mitotic spreads (Figure 2 Additional files 1 and 2) to make ploidy determinations. In contrast with relativistic DNA-Seq measurements, mitotic chromosomes can be counted directly to determine chromosome number, although it is not always possible to determine exact chromosome identity due to rearrangements. We observed that S1, Kc167, S2-DRSC, S2R+, S3 and D20-c5 were tetraploids. BG3-c2 and 1182-4H cells were diploid. The DNA-Seq read ratio patterns for D20-c5 suggested minimal diploidy, not tetraploidy, which may be due to a whole genome duplication following establishment of a relative copy number profile as detected by DNA-Seq.

Karyotypes. (A,B) Metaphase spread figures of S2R + cells (A) and as aligned in karyograms (B). Either wild-type, or close to wild-type chromosome 2 s and 3 s are designated with ‘2’ and ‘3’. If rearrangements were found on them, such as deletions, inversion or translocations, they are marked with ‘r’ (2r and 3r). Small chromosomes that carried euchromatic material appended to a centromeric region that was likely to derive from a large autosome are labeled as ‘am’. Chromosomes whose origin could not be determined are labeled ‘nd’. (C) Chromosome numbers in metaphases from 145 S2R + cells. (D) A heatmap summarizing chromosome numbers. Metaphase spreads for all the cell lines are provided in Additional file 1.

Interestingly, the karyotypes of individual cells varied in all lines (Figure 2 Additional file 1). Prima facie, the variable numbers of chromosomes in the cells is in disagreement with the consistency of the DNA-Seq calls. For example, DNA-Seq results indicated tetraploidy for D17-c3 cells, yet the karyogram showed a mixed state with diploid and tetraploid cells. Despite these heterogeneous ploidies, the DNA-Seq values for independent cultures (separated by an unknown, but presumed large number of passages) showed good agreement. These data suggest that even if the cell-to-cell karyotypes differ, the distribution of karyotypes is stable in the population of cells from a given line.

Chromosomal gains and losses in cell lines

We identified frequent numeric aberrations of the X, Y, and fourth chromosomes. X chromosome karyotype is a natural copy number deviation that determines sex in Drosophila. Sexual identity is fixed early in development by Sex-lethal (Sxl) autoregulation [42], so deviations in the X chromosome to autosome (X:A) ratio that may have occurred during culture are not expected to result in a change in sex. Therefore, we used DNA-Seq-derived copy number and then expression of sex determination genes in expression profiling experiments (RNA-Seq) to deduce if the X chromosome copy was due to the sex of the animal from which the line was derived, or if the copy number change was secondary during culture.

In control females (Figure 1), there was a single peak of DNA read density centered on approximately 1 regardless of whether the reads mapped to the X chromosome or to autosomes. In the cell lines there were clear cases of X:A = 1 (that is, female), X:A = 0.5 (that is, male), and some intermediate values. DNA-Seq results for the S2-DRSC, BG3-c2, Cl.8, D20-c2, D20-c5, D4-c1, L1, mbn2, S1, S3, Sg4 and W2 lines showed under-representation of reads mapping to the X chromosome (X:A <0.75), suggesting that they are male, or female cells that have lost X chromosome sequence. Similarly, by these criteria Kc167, D8, D9, D16-c3 and D17-c3 cells appear to be female (X:A >0.75), but might also be male with extensive X chromosome duplications. Cytological analysis confirmed these findings (Additional file 1).

To determine sexual identity we analyzed the expression of sex-determination genes and isoforms from RNA-Seq data compared to those from 100 different lines of sexed D. melanogaster adults (Table 2). In Drosophila, the MSL complex (MSL-1, MSL-2, MSL-3, MLE proteins, and RoX1 and RoX2 non-coding RNAs) localizes to the X chromosome and hyper-activates gene expression to balance transcription levels to that of autosomes [43]. The alternative splicing of Sxl pre-mRNAs controls SXL protein production, which in turn regulates MSL formation by modulating msl-2 splicing and protein levels. Sxl also regulates sex differentiation via the splicing of transformer (tra) pre-mRNA [44, 45]. Except for D9 cells, we observed that the two RNA components of the male-specific MSL complex (roX1 and roX2) genes were expressed at female levels in the cell lines with X:A >0.75 (Kc167, 1182-4H, D8, D16-c3, and D17-c3), suggesting that observed DNA-Seq copy number values were due to the female identity of the cells used to establish these cultures. Similarly, cell lines that had an X:A <0.75 (D4-c1, BG3-c2, Cl.8, D20-c5, L1, mbn2, S2-DRSC, S2R+, S3, Sg4, W2 and S1) expressed roX1 and/or roX2 at male levels, which was again consistent with the deduced sex. The expression of msl-2, tra, and Sxl were also consistent with sex karyotype. Overall, the cell lines with a X:A >0.75 showed female expression, while those with a ratio of <0.75 showed male expression (P < 0.01, t-test) however, there was some ambiguity. For example, D9 expressed intermediate levels of roX1, male levels of msl-2 and female tra. We suggest that in the majority of cases X chromosome karyotype is the result of the sex of the source animals, but where karyotype and sex differentiation status are ambiguous, the X chromosome copy number may be due to gains/losses during culture.

Interestingly, both functionally redundant roX genes were expressed in whole adult males (not shown), while in the cell lines, sometimes only one roX gene was highly expressed. To determine if expression of a single roX gene was sufficient for MSL-complex-mediated dosage compensation, we measured X chromosome gene expression relative to autosomes. Overall transcript levels from genes from the X chromosomes in the cells that expressed roX genes at male levels were not significantly different from those of autosomes (P > 0.25 for all cell lines, t-test), suggesting that having a single roX is sufficient for normal X chromosome dosage compensation in these cell lines.

We observed frequent loss of the Y chromosome from the male cell lines. The D. melanogaster Y chromosome is not currently assembled, but some Y-chromosome genes are known. DNA-Seq reads were mapped on the Y chromosome (chrYHet) in a minority of the male cell lines (BG3-c2, Cl.8, S1, and W2) and we observed Y chromosomes by cytology in BG3-c2, Cl.8 and S1 lines (Additional file 1). The failure to map reads to Y chromosomes in the other male lines (D20-c5, L1, mbn2, S2-DSRC, S2R+, S3, Sg4) was also consistent with karyograms and reflects loss of Y chromosomes (Additional file 1). The Y chromosome bears only a few fertility genes (X/0 flies are sterile males) that should be of little consequence outside the germline. Frequent loss suggests that there is little selective pressure to maintain a Y in tissue culture cells.

Lastly, we observed widespread loss/gain of the short (approximately 1.4 Mb) fourth chromosome in cell lines by both DNA-Seq and cytology (Figure 3A Additional file 1). The number of fourth chromosomes was variable within cell lines as well. As an illustration, in Cl.8 cells where overall genome structure is relatively intact diploidy, the number of fourth chromosomes varied from 0 to 3. This observation was also supported by DNA-Seq results, which demonstrated clear decrease of copy number (combined P < 1.0e-11, false discovery rate (FDR)-corrected permutation test).

DNA copy numbers. (A) Plots of mapped DNA read density along the genome. Deduced copy number is indicated by color (see key). (B) Heatmaps display how many cell lines have increased (green) or decreased (red) copy number. Black lines in the first two rows show significance. Blue lines indicate breakpoints. Black in the bottom row shows the number of breakpoints shared by the 19 cell lines. (C) A zoomed-in map of the sub-telomeric region (1 Mb) of chromosome 3 L. Asterisks: genes within the highly duplicated regions. Genes with little or no functional information (‘CG’ names) were omitted for brevity.

Segmental and focal copy number changes

We observed frequent sub-chromosomal copy number changes (Figure 3A Additional file 3). Some of the larger departures from ploidy were also identifiable in the karyograms. For example, mitotic spreads of S1 cells exhibited an acrocentric chromosome that looked like the left arm of chromosome 2 (‘2r’ in Additional file 1), which was reflected in DNA-Seq data as extended high copy number block. However, most of the focal changes were submicroscopic in the low megabase range. Collectively, we observed more increases of copy number (1,702) than decreases (388). On average, 12.9% of the haploid genome was duplicated, or gained, while 6.3% was deleted, or lost 95% of the copy number blocks were shorter than 0.8 Mb (median = 37 kb) in the case of increased copy and 1.8 Mb (median = 97 kb) in the case of decreased copy.

DNA-Seq data showed that genome structure was cell line-specific. For example, in Cl.8 cells we observed few copy number changes, which were spread over multiple small segments covering only 0.88% of the genome. In contrast, in S2-DRSC and Kc167 cells, we observed copy number changes for >30% of the genome. Interestingly, Kc167 cells had more low copy number regions than high copy number regions, while S2-DRSC had more high copy number regions than low copy number regions. These data indicate that there are fundamentally different routes to a highly rearranged genomic state.

While the overall genome structures were cell line-specific, we did observe regions of recurrent copy number change. While some of the cell lines (for example, S2R + and S2-DRSC) are derived from a single ancestral cell line and differ by divergence, the majority of the cell lines were isolated independently, suggesting that similarities in genome structure occurred by convergent evolution under constant selection for growth in culture. Our investigation revealed 89 regions of the genome covering a total of approximately 9.3 Mb showing strong enrichment for increased copy number (Figure 3B P < 0.05, FDR-corrected permutation test). Among those segments, 51 regions were longer than 5 kb. We also found 19 regions covering approximately 2.9 Mb with significant enrichment for decreases in copy number 14 of these regions were longer than 5 kb. Driver genes promoting growth in culture may be located in these regions.

We examined regions of recurrent copy number change more closely to identify some candidate drivers. As an illustration, duplications of sub-telomeric regions of chromosome 3 L (approximately 3 Mb) were found in 10/19 cell lines (combined P < 1.0e-16, FDR-corrected permutation test). The most overlapping segment within this region was a duplication region of approximately 30 kb. There are six annotated genes in this core duplicated segment (Figure 3C, asterisks): CR43334 (pri-RNA for bantam), UDP-galactose 4′-epimerase (Gale), CG3402, Mediator complex subunit 30 and UV-revertible gene 1 (Rev1). When we asked if any of these specific genes showed increased copy number in the other cell lines, even if segmental structure was lacking, we found that CR43334 and Rev1 had higher copy numbers in five additional cell lines. As another example, an approximately 19 kb duplication region in chromosome 2 L was found in 10 different cell lines (combined P < 1.0e-17). This region included only one gene, PDGF- and VEGF-receptor related (Pvr), suggesting that copy number for this gene is highly selected for in cell culture. If genes in these recurrent copy number increase regions were drivers, then we would expect that they would be expressed in the cells. Indeed, pri-bantam and Pvr genes were highly expressed in the cell lines (Additional file 4).

Mechanisms generating segmental and focal copy number changes

Creation of common copy number changes would be facilitated by repeated breakage at ‘hot spots’ in the genome due to regions of microhomology or longer stretches due to structures such as inserted transposons. In the absence of selection, the extant breakpoint distribution would map the positions of such hot spots. We mapped breakpoints by examining read-count fluctuations in every 1 kb window over the genome to identify 2,411 locations with breaks in at least one of the 19 cell lines (Figure 3B Additional file 3). Among these breakpoints, we discovered 51 hotspots of copy number discontinuity in the same 1 kb window (P = 5.00e-06, permutation test). This suggests that there are regions in the genome that suffer frequent breaks in tissue-culture cells. Investigation of hot spots revealed 18 containing long terminal repeats (LTRs) or long interspersed elements (LINEs) in the reference assembly, and an additional 9 regions showed simple DNA repeats within the 1 kb (±1 kb) windows. These observations are consistent with reports of overrepresentation of sequence repeats at copy number breakpoints [13], and with the suggested roles of transposable elements in the formation of copy number variants [46, 47]. For the recurrent copy number change regions, we observed a broad regional enrichment for breakpoints (P = 4.07e-10, Fisher’s exact test), but not precise locations. These data suggest that there were both structural features in the genome that promoted generation of copy number changes and selection that determined which copy number changes were retained.

Expression and DNA/chromatin binding profiles in relation to copy number

If copy number changes have a role in cellular fitness, the effect might be mediated by altered gene expression. We therefore examined the relationship between gene dose and expression in 8 cell lines that had more than 100 expressed genes in high or low copy number segments (Figure 4). In seven cell lines (S2-DRSC, S2R+, mbn2, Kc167, D8, D9 and D17-c3) mRNA level was positively correlated with gene dose. There was no correlation between gene expression and gene dose in Sg4 cells. Even in the cases where the correlation was positive, the correlation was usually not linear, as has been previously observed [31]. In most lines, we observed decreased expression per copy of high copy number genes (P < 0.05, Mann-Whitney U test). Similarly, overall gene expression of the low copy number genes was moderately higher than expected on a per copy basis (Figure 4). This sublinear relationship is evidence for a transcriptional dampening effect.

Copy number and expression. RNA-Seq analysis of S2-DRSC, S2R+, Sg4, mbn2, Kc167, D8, D9 and D17-c2 cells. Boxplots show interquartile ranges of the distribution of FPKM (fragments per kilobase per million reads) values of expressed genes (FPKM >1) for different copy number classes in the indicated lines. The number of genes in each class is shown. All FPKM values are centered to have the median of normal copy number gene expression as 0. Top, middle, and bottom lines of boxes correspond to upper quartile (Q3), median, and lower quartile (Q1) in the distribution, respectively. Notches show the 95% confidence interval of each median. Whiskers indicate the maximum, or minimum, value that is still within 1.5 times of interquartile distance (Q3 - Q1) from Q3 or Q1, respectively. Horizontal dashed lines indicate the expected FPKM values based on a one-to-one relationship between gene dose and expression. Asterisks display P-values, determined by Mann-Whitney U test (*P < 0.05, **P < 0.01, ***P < 0.001).

The transcriptional response to gene copy number could be gene-specific or dose-specific. A dose-specific compensation system might be expected to result in a global change to chromatin structure corresponding to copy number segments. There is precedent for such dose-specific modifications of X and fourth chromosomes. For example, the modENCODE chromatin structure analysis of S2-DRSC cells clearly shows differences between X and autosomal chromatin using any of a host of histone modification or binding of chromatin-associated proteins (Figure 5). This is consistent with the global regulation of the X in these male cells by the MSL complex and perhaps other regulators [27, 28].

Copy numbers and chromatin immunoprecipitation. (A,B) A heatmap that summarizes correlation between copy numbers and chromatin immunoprecipitation (ChIP) signals of expressed genes in S2-DRSC (A) or Kc167 (B) cell lines. Target proteins for ChIP and modENCODE submission numbers are listed (right side). Columns show autosomal promoter regions (1 kb upstream of transcription start) and gene body regions as indicated. (C,D) ChIP signals of H3K9me2 (C) and SU(HW) (D) at autosome gene bodies are displayed against different copy number classes as boxplots (S2-DRSC cells). Top, middle, and bottom lines of boxes for upper quartile, median, and lower quartile points, respectively. Notches indicate the 95% confidence interval of each median and whiskers display the maximum, or minimum, value within the range of 1.5 times of interquartile distance, respectively. Dots display individual genes within different copy number classes. Pearson’s correlation for r and its significance (P-value). (E,F) ISWI ChIP signal analyzed for X chromosome gene bodies in a male (S2-DRSC E) and a female (Kc167 F) cell line. TSS, transcription start site.

To determine if there was a chromatin signature for copy number, we asked if there were histone modification marks or occupancy sites that correlated with copy number classes in 232 modENCODE ChIP-chip datasets from S2-DRSC, Kc167, BG3-c2 and Cl.8 cells. We observed only a few weak correlations (|r| = 0.1 to 0.3), restricted to histone H3K9 di- and tri-methylation marks, and their related proteins (Figure 5), Suppressor of Hairy wing (SU(HW)), and Imitation SWI (ISWI). These correlations were slightly stronger for expressed genes. Interestingly, ISWI binding correlated with copy number on the X chromosome of male S2-DRSC cells, but not female Kc167 cell X chromosomes. ISWI binding did not correlate with autosomes of either line. This localization on the X is consistent with the known role of ISWI protein in X chromosome structure, as ISWI mutant phenotypes include cytologically visible ‘loose’ X chromatin only in males [48, 49]. We found that histone H3K9me2 and me3 marks were negatively correlated with gene copy numbers in all four tested cell lines on all chromosomes. The histone H3K9 methyltransferase, Suppressor of variegation 3-9 (SU(VAR)3-9), showed the same pattern of binding, strongly supporting the idea that H3K9 methylation is a copy number-dependent mark. H3K9me2 and H3K9me3 epigenetic marks are associated with transcriptional repression [50]. SU(HW) functions in chromatin organization and is best known for preventing productive enhancer promoter interaction. Thus, the relationship is the opposite that one would expect if H3K9me2, H3K9me3, and SU(HW) were responsible for the reduced expression per copy we observed when copy number was increased. These results are more consistent with selection to drive down expression of these regions by both reduced copy number and transcriptionally unfavorable chromatin structure.

Pathway coherence

If there has been selection for particular advantageous copy number configurations in the cell lines, then this should result in a coherent pattern of events in terms of specific cellular activities such as growth control. As a first pass analytical tool, we performed Gene Ontology (GO) term enrichment analysis to determine if copy number changes were associated with particular functions (Figure 6 Additional file 4). Tissue culture cells have no obvious need for many of the functions associated with the complex interactions between tissues and organs in a whole organism and should not undergo terminal differentiation. Indeed, we found that genes with differentiation functions were randomly found in copy number change regions but were enriched in low copy number regions in Kc167 cells (P < 0.001, Holm-Bonferroni corrected hypergeometric test). Additionally, we found increased copy numbers of genes encoding members of the dREAM complex in S2-DRSC, mbn2, S1 and S2R + cells. The dREAM complex represses differentiation-specific gene expression [51, 52], consistent with selection for copy number changes minimizing differentiation.

Gene Ontology and copy number in S2-DRSC and Kc167 cells. (A) ‘Biological processes’ sub-ontology of overrepresented genes in S2-DRSC cells as a hierarchical structure. Circle size corresponds to relative enrichment of the term in GO categories. Circle colors represent P-values (Holm-Bonferroni corrected hypergeometric test). (B) GO enrichment of genes in low copy number segments of Kc167 cells. Please note that both S2-DRSC low and Kc167 high copy number genes are not significantly enriched in specific GO categories.

The most significant associations (P < 0.001) between copy number class and function were with genes having cell cycle, metabolic, or reproduction-related GO terms (reproduction-related categories contain many of the cell cycle genes due to the high rates of cell divisions in the germline relative to somatic cells in adult Drosophila). Interestingly, genes with cell cycle-related functions were enriched in both high copy number regions in S2-DRSC and low copy regions in Kc167 cells (P < 0.001 for both). The context of this dichotomy was informative. Genes with high copy numbers in S2-DRSC cells included Ras oncogene at 85D, string, Cyclin D, cdc2, and other positive regulators of cell cycle progression, or mitotic entry. These data suggest selection for growth occurred in S2-DRSC cells. In contrast, tumor suppressor genes, and negative regulators of cell cycle, including Retinoblastoma-family protein (Rbf), Breast cancer 2 early onset homolog (Brca2), and wee, were preferentially found in the low copy number regions of Kc167 cells, suggesting that inhibitors of cell growth were selected against in Kc167 cells. Thus, both the high copy number and low copy number events can be explained by selection for proliferation.

Compensatory copy number changes

Copy number changes in adult Drosophila result in propagation of transcriptional effects into the rest of the genome [53]. As these events can destabilize gene balance in pathways and complexes, we hypothesized that compensatory copy number changes might boost fitness. To examine this possibility, we asked if genes have undergone copy number changes to maintain protein-protein complex stoichiometry by overlaying copy number information of S2R + cells onto a physical protein interaction network that was built from complexes isolated from the same cell line [54].

There were 142 protein-protein interaction networks that contained at least one gene product encoded from copy number change regions (Figure 7A). Among these, we identified 84 complexes that had >90% co-occurrence of copy number change in the same direction at the gene level (P = 0.041, permutation test). These copy number changes were not due to passenger effects as stoichiometry-preserving changes in copy number were still evident after filtering for nearby genes (P = 0.03). Examples included the genes encoding Vacuolar H + ATPase (P = 0.017, hypergeometric test) and Dim γ-tubulin (DGT) complexes (P = 0.004), where members were among high copy number genes (Figure 7B,C). For both complexes, genes encoding their components were spread on five different chromosome arms with only a pair of genes showing <0.5 Mb proximity, indicating that the co-associations are not due to simple physical proximity in the genome. We also identified complexes where the encoding genes were in low copy, such as a Cytochrome P450-related complex (P = 0.001 Figure 7D). We found correlated copy number changes even for very large complexes, such as the small GTPase related-complex (cluster 6), which has 38 proteins. Twenty-four of the loci encoding cluster 6 members were present at high copy (Figure 7E P = 5e-04). By examining complexes where we failed to score a simple correlation, we uncovered more complicated patterns where sub-components of the complex show correlated and anti-correlated copy number changes. A good illustration is the proteasome (Figure 7F). While the overall composition was consistent with genome-wide copy number levels, we found that genes encoding the lid of the regulatory 19S subunit showed coherent copy number reduction in S2R + cells (P = 0.015, hypergeometric test). In contrast, proteins composing the base and alpha-type subunits of the 20S core were dominated by copy number gains (P = 0.017 and 0.014, respectively). This suggests that the actual occurrence of coherent copy number changes among genes encoding protein complex members may be higher than what we report here.

Copy number and physical interaction networks. (A) A ternary plot that displays fractions of high, normal, and low copy number genes that encode complexes in Drosophila protein-protein interaction networks. Each point corresponds to a protein complex or a cluster. Distances from the three apexes in the triangle indicate fraction of cluster members from a given copy number class. Dashed lines indicate expected portion of each copy number class based on a random distribution of S2R + cell line copy numbers. Complexes where copy number composition is significantly different from the expected ratio (P < 0.05, hypergeometric test) are filled in blue. (B-F) Protein interaction networks described and labeled in (A). Green, high copy gene products red, low white, normal. For (F), six proteins whose associations with the proteasome parts are not clear in the literature were omitted.


We suggest FlyBase be referenced in publications by citing this publication and the FlyBase URL ( We also recommend that when you are using FlyBase data (in your notebooks, spreadsheets, papers etc.) you make note of the FlyBase web site release (e.g. FB2011_08 the current release can be found in the header and footer on every page) and/or the sequenced species assembly.version release (e.g. D. melanogaster R5.40, found in the GBrowse header). In addition, we recommend that authors incorporate FlyBase object identifiers (e.g. FBgn and FBal) in addition to symbols for the unambiguous identification of intended FlyBase entities. Finally, we suggest that when preparing supplementary materials, you provide tabular data either in tab-separated files or in a spreadsheet rather than a PDF. Following these recommendations will greatly aid FlyBase curators in integrating your data into FlyBase.

Featured on the OSDC

A cloud-based system for genomic data, developed in collaboration with the Institute for Genomics and Systems Biology at the University of Chicago. Bionimbus is used by members of the NIH-funded modENCODE Consortium to analyze data produced by the project.

A collaboration with NASA to process Earth Observing 1 (EO-1) satellite imagery to detect fires and floods and provide relevant information to first responders. The data is freely available from the OSDC to interested users.

The open source software that drives the OSDC console, named after John Tukey, an American mathematician. Tukey has a number of features not found in other cloud console applications.

Learn more about the OSDC in this short video featuring OSDC Founders and Partners at the 2013 Supercomputing Conference in Denver, Colorado.


The last 15 years many bioinformatics methods and tools have been developed for cis-regulatory sequence analysis ( 64). Broadly, they can be divided in two categories. The first category is methods for motif discovery on a set of co-regulated sequences, such as MEME-like approaches (dozens of methods and extensions exist). The second category are methods for CRM prediction through whole-genome scanning using one or more known motifs as input, often using Hidden Markov Models and sequence conservation cues [see ( 65) for a review]. A few methods, such as phylCRM/Lever, ModuleMiner and cisTargetX combine both approaches and show increased motif discovery performance, even when very large upstream regions and introns are included in the analysis ( 28, 30, 31). The concept of these integrative methods is to apply genome-wide CRM scoring, including comparative genomics cues, for many different models (e.g. PWMs), followed by the identification of those particular models that yield the highest accuracy on a set of co-expressed genes. In this work we have introduced three important novelties into a new method, called i-cisTarget. The first is the a priori determination of 136K regions to be scored, which leads to an increased flexibility. In particular, this partitioning of the genome allows to analyse both data sets of genomic loci (by selecting all 136K regions that overlap these loci) and co-expressed gene sets (by selecting all 136K regions that fall in the upstream and intronic space of all genes in the set). In this study we obtained good results for a genome segmentation using sequence conservation (phastCons) combined with insulator sites, and excluding coding exons. However, we envision that improvements can be made on the genome segmentation, for example by including coding exons ( 66) or using a segmentation that is guided by the high-throughput data sets (i.e. the iVEs) themselves. The latter can become practical as more and more data sets are generated with overlapping results, which may ultimately converge to a defined set of regulatory regions. The second novelty is the generalization of regulatory feature discovery, with the possibility to identify enriched motifs (as PWMs) but also enriched iVEs such as ChIP-peaks, and active/repressive chromatin marks. The third novelty is the ability to perform any combination of regulatory features, even across different types of features (e.g. a motif with a ChIP or DHS feature).

Taken together, these features allow analysing most kinds of high-throughput data available in Drosophila, and to combine several analyses using the same tool for different datasets. For example, it is possible to combine the analysis of binding location data for a particular factor (ChIP) with the analysis of the corresponding expression data in mutant conditions for this factor, as we have shown for MEF2 ( 57) and Zelda ( 48, 56).

We have applied our tools on various datasets, distinguishing gene sets from sets of genomic loci. For gene sets, we have shown that i-cisTarget identifies the enrichment of the correct motif in most gene sets we investigated failures to do so might be explained by the specificity of the binding motif to certain conditions or tissues. Enriched iVEs can lead to interesting new hypotheses, such as the co-operation between daughterless and Medea, inferred from the PNC set analysis, that resembles the recent discovery of Smad co-operation with master regulators ( 53) or the prediction of new TF-target and TF-TF interactions across cell types in Drosophila, as was demonstrated for Kenyon cells, pericardial cells and cardioblasts ( Figure 5). Moreover, the discovered motifs lead to CRM predictions in the 5 kb + 5′-UTR + first intron of the input genes that have a high specificity to be regulatory regions, as was demonstrated on the zelda LOF dataset ( 56) and the PNC dataset ( 51). A current limitation of i-cisTarget, when analysing gene sets, is the arbitrary assignment of genomic regions to the gene set. Multiple demarcations are available at the i-cisTarget web tool, for example [5-kb upstream limited to upstream gene, 5′-UTR, and first intron] or [10-kb upstream limited, 5′-UTR, all introns, 3′-UTR and 10-kb downstream limited to downstream gene] (see ‘Materials and Methods’ section). A future challenge remains identifying very distal enhancers and enhancers overlapping the coding sequence of nearby genes ( 66). A simple extension of the sequence search space, including more sequence and including intronic and exonic sequences from neighbouring genes, will not solve the problem. Indeed, when applying i-cisTarget to 100-kb upstream and downstream sequence of the TSS (this search space includes 100% of REDfly CRMs), without truncating this sequence at neighbouring genes, the performance drops dramatically (see Supplementary Figure S2 ).

We also used several ChIP datasets to investigate the performance of i-cisTarget on sets of genomic loci. Here, as for the gene sets, i-cisTarget performs very well in recovering the expected motif from a comprehensive library of motifs, but also highlights the involvements of other factors, such as Zelda or Trl in embryonic datasets. While motif discovery or enrichment is also performed by several other tools ( 45, 67), i-cisTarget adds the possibility to search for additional iVEs. We have shown that a TF-binding site (TFBS) does not necessarily correspond to a binding event. While potential binding sites for HSF or MEF2 cannot be distinguished from actual binding events based on motif enrichment alone, adding iVEs clearly selects marks typical for active chromatin as the best discriminant between actually bound or unbound sites. We emphasize that this result is obtained ab initio, without any prior knowledge of which are the relevant iVEs. Hence, additional signals are needed for a TF to bind to a motif sequence, and these are often related to marks of open or active chromatin: DNAse hypersensitive sites, binding of pioneering factors such as Trl or Zelda, whose role as a general precursor of chromatin opening has only very recently been hypothesized ( 48). Interestingly, while in both HSF and Mef2 cases, the bound motifs present an enrichment for active features (GAF/Trl, CBP/p300, or DHS), the pattern of enriched features for unbound motifs is quite different. Namely, the unbound MEF2 motifs present an enrichment for repressive chromatin marks [Su(HW) or heterochromatin like features], while the unbound HSF motifs do not present any of these marks, consistent with what was reported in Guertin et al. ( 44). This might suggest a distinct mechanism of negative regulation through chromatin conformation between developmental processes and stress response pathways.

A feature of our approach that is not found in alternative studies is the ability to easily combine any number of features to investigate the synergistic effect of different features. Being based on ranks, using OS allows an ‘on-the-fly’ re-ranking of the 136K regions using particular combinations. We showed on the PNC and zelda gene set that combinations of PWM and iVE yield higher 1%-AUCs meaning a much higher specificity in the high ranking regions ( Figure 4). This last result shows that transcriptional regulation is not a linear process, in the sense that the contributions of the combination of regulatory features is more than the addition of individual contributions, revealing a synergistic mechanism of action. Moreover, the fact that many different regulatory features are found enriched in the datasets we have studied previously confirms that transcriptional regulation is intrinsically a highly combinatorial process.

These two aspects (combinations and synergy) have already been extensively described before in the context of the enhanceosome model of regulation ( 68, 69). In particular, in Drosophila, analysis of a collection of curated CRMs showed that they are typically characterized by a combination of different TFBSs ( 70, 71). This heterotypic model has been shown to be the general rule, while homotypic CRMs are generally restricted to early embryogenesis ( 71).

However, these descriptions focused on combinatorial regulation by TFs alone. Here, we have confirmed recent evidence that this combinatorial regulation extends to other kinds of regulatory features such as histone modifications, binding of chromatin-modifying proteins or transcriptional co-factors such as CBP. Hence, we propose that the notion of heterotypic model of regulation should be extended to describe any combination of regulatory features, including motifs and chromatin-related features. Similarly to the CRM finding procedure consisting of finding clusters of TFBS for different TFs ( 26), we introduce and show that searching for ‘clusters’ of regulatory features can improve the predictive power of regulatory sequence analysis.

While our method currently applies to Drosophila, it can in principle be extended to any other organism for which large-scale collections of in vivo datasets are available, and in particular to human. The much greater size of non-coding regions in human, and the lower proportion of functional DNA in the human genome ( 72), would however require to pre-select candidate regulatory regions, as using a full partition of the complete non-coding genome would become computationally untractable and would contain too high noise levels. We are currently working on implementing i-cisTarget for human, using the collection of ENCODE datasets.

Useful Guide on Writing an Excellent Biology IA Paper for IB

What’s the purpose of an IA paper, you might wonder? This crucial assignment aims to evaluate students’ application of their skills and knowledge in the subject. It takes theoretical knowledge a step forward and gives students a chance to put their learning into practical use.

How to Choose an Appropriate Topic for Biology IA

A lot depends on the topic you choose to write on for your Biology IA. Here’s how you can make sure you make the right choice that aptly demonstrates your comprehension of the subject.

Choose a Topic of Interest

“How does that matter? I’ll just put something together” - no, that’s not how it works. Believe it or not - when you are disinterested in a topic, it reflects in your work.

So, before beginning work on your Biology IA, it is advisable to choose a topic that interests you - something you would like to investigate and know more about.

While brainstorming, it’s a good idea to think of broad subjects and narrow it down to finally come down to a focused topic or research question.

Browse Relevant Sources

Inspiration can be anywhere - this holds true not only for creative writing but also something technical like a Biology IA paper.

What relevant sources are we talking about here? Browse science or biology websites like Science Daily, encyclopedias and make a run to a library to see if you find any Biology books or journals that could give you topic ideas.

Once you identify topics that catch your eye, you can frame research questions that would be suitable for this assignment.

Needs to Involve Complex Research

How complex is complex? Well, if you choose a topic/research question that is obvious or simplistic, your investigation really has no value to offer.

In the same way, choosing a research question that has been studied in class is a bad idea too because you are not using your originality to bring something new to the table. The only way to demonstrate your individuality and original thinking is by choosing a novel topic.

Hence, choose a topic that is not too broad or narrow. A broad topic is not likely to be novel or focussed, similarly a narrow topic does not give you too much scope for investigation and analysis.

Consider Constraints

While some topics might sound great on paper, they might not be feasible given the time, equipment and resources you have in hand.

Be realistic about the research question and the investigation you plan to do - can it be done in the time you have? Is the experiment feasible? Do you have all the equipment required? ..and so on.

Another important aspect to note is that you should be able to present a strong hypothesis and define your independent and dependent variables.

Needs to be Safe and Ethical

The topic you choose needs to adhere to the safety, environmental and ethical issues. This is an important protocol of any experimental work.

For example, if your experiment involves people or animals - you need to address the safety precautions used. IB has strict policies against causing harm and distress to animals. Similarly, if any individuals are participating in your study, you are required to get their consent for the same.

Keeping all this in mind, make sure you choose a topic that respects this protocol.

5 Essential Tips to Write a High Scoring Biology IA

There are several factors that contribute to a high-scoring Biology IA paper. Here are 5 essential tips to help you ace this assignment.

Word the Question Well

The research question forms the first impression of your Biology IA paper - make sure you word it clearly and accurately. The question should be concise and focussed. For example, if your investigation involves a particular organism, make sure you mention the scientific name.

Cater to Each Marking Criterion

You will be evaluated on the following criteria

  • Personal Engagement (2 points)
  • Exploration (6 points)
  • Communication (4 points)
  • Analysis (6 points)
  • Evaluation (6 points)

While writing the paper, it’s important you cater to each of these as that is how it will be viewed by the teacher. The marking criteria is designed such that every aspect of the student’s skills and knowledge are evaluated.

Personal engagement focuses on assessing students’ originality and creativity - the IA paper needs to demonstrate the student’s unique thought process while carrying out the investigation.

Exploration refers to the methodology undertaken to complete the investigation and the reliability and sufficiency of the data used.

Watch Alex Lee get into the depth of how to demonstrate ‘exploration’ in the Biology IA paper

Communication takes into consideration the overall presentation of the paper which includes the flow of paper, vocabulary, grammar, use of scientific terms and presentation of data.

Analysis studies how the data was generated and the treatment it was given to come to the conclusion. Any conclusion drawn needs to be based on the evidence received from the data generated.

Evaluation looks at the relevancy of the conclusion and if manages to answer the research question.


A well-structured and well-written IA paper is sure to get you high scores because it involves all the key aspects you are expected to cover. This is the ideal structure for a Biology IA paper.

  • Formulate a clearly defined research question based on the chosen topic for your IA
  • Make it specific as possible by making it one brief sentence
  • Capture your dependent and independent variables

Background Information

  • Provide the theory behind your investigation and demonstrate your understanding
  • Set the research question into context
  • Use citations to support the biological theory

Hypothesis and variables

  • Provide experimental hypothesis and the null hypothesis
  • Provide dependent variables, independent variables and controlled variables with their respective units
  • Include the impact of each variable

Materials, methods and safety issues

  • List apparatus including their sizes and uncertainty
  • Provide a systematic procedure of conducting the investigation and recording the results
  • Include safety issues along with ethical issues
  • Provide raw data in tables along with calculations
  • Provide processed data using diagrams, lists, tables, graphs, etc.

Data analysis and interpretation

  • Analyze the findings
  • Discuss the impact of the uncertainty and include error bars
  • Discuss the pattern of data based on the research question

Conclusion, evaluation and bibliography

  • Make a summary of the findings and relate them to the RQ
  • Mention the limitations and strengths of the investigation
  • Provide future extension of the investigation
  • Provide relevant references in the bibliography section
Here’s a useful video on structuring your Biology IA paper

Logical Presentation

The most appropriate way to present statistical data and findings is with the help of charts and graphs. Instead of writing it in words, graphs, charts and other visual forms of representation are more impactful in delivering the message.

Hence, wherever applicable, make sure you use well-illustrated graphs, charts or tables. They need to be labelled well such that they add value to the data you provide.

Start Early

Writing a Biology IA is time-consuming. There might be instances when you might find yourself drowning in a lot of work and find it difficult to do justice to this assignment.

At such times, we suggest reaching out to IB IA writing experts such as Writers Per Hour. Our team of professional writers know what it takes to deliver a well-articulated Biology IA paper that meets your deadlines and requirements.

So, instead of doing a rushed job, let our experts handle this crucial assignment for you.

10 tips to finishing your PhD faster

August 10, 2010, was a great day for Rodney Rohde &ndash he finished his PhD. And he did it in four years while working as an Assistant Professor and then Associate Professor at Texas State University.

Now, as Professor, research dean and program chair of the Clinical Laboratory Science program in the College of Health Professions, he spends a great deal of time mentoring and coaching others in this sometimes mysterious and vague path.

Dr. Rohde's background is in public health and clinical microbiology. He has a bachelor's degree in microbiology, a master's degree in biology/virology and a PhD in education from Texas State. His dissertation was aligned with his clinical background: MRSA knowledge, learning and adaptation.

His research focuses on adult education and public health microbiology with respect to rabies virology, oral rabies wildlife vaccination, antibiotic resistant bacteria, and molecular diagnostics/biotechnology. He has published over 25 research articles and abstracts and presented at over 100 international, national and state conferences. He was awarded the 2012 Distinguished Author Award and the 2007 ASCLS Scientific Research Award for his work with MRSA. Recently, his work was the focus of an educational campaign regarding the important research focus of MRSA, which featured Dr. Rohde in a video by Texas State University that has been used by numerous media outlets. Learn more about his work here.

Recently, I came across a very interesting article here by Andy Greenspon, a PhD student in applied physics at Harvard: "9 things you should consider before embarking on a PhD." I thought Andy gave some fantastic advice, and it reminded me of a promise I made to myself while working on my PhD. In the wee hours of the night poring over coursework, informed consent documents, data analysis, and the umpteenth version of my dissertation, I vowed that if I ever finished my PhD, I would try to help others through the quicksand of a graduate school journey.

I hope I can begin to offer some help in the way of this list. Really, there's much more than I can put in a list of 10 items, so be on the lookout for more advice to follow.

1. Immerse yourself in writing &ndash and learn how to write a funding proposal

Some might say this is more important after you finish a PhD. Don't fall into that trap. Learning how to write a funding proposal is nothing like writing your dissertation or a typical journal article. However, all types of funding proposals (federal, state, foundations, private/corporate, military) may offer you an opportunity to actually fund your research while working on your PhD. And it may very well be your best and most attractive resume item to landing a great job. For example, my professional organization, the American Society for Clinical Laboratory Science, offers research grants to conduct graduate research. I was able to fund most of my research budget by this opportunity. Many other federal granting agencies, organizations and private foundations will have funding opportunities that often offer graduate students a vehicle to fund their research, especially if you are conducting research that is important to that agency/foundation mission.

2. Find a strong mentor

I can't stress how important this is. Can it be yourDissertation chair? Possibly, but find someone that can give you critical feedback on projects and encouragement. I was fortunate to have several colleagues in my college that had taken the PhD journey. I surrounded myself with several of these "PhD veterans," and they were able to help me avoid hurdles that could have slowed me down. They also were able to provide the most important thing a grad student might need &ndash understanding and constant feedback. Think about finding someone that knows how to motivate you to finish jobs. It might be a colleague or a former professor. However, it should not be a friend that tells you all things will be just fine.

3. Grow a thick skin and take critical feedback for what it is &ndash constructive criticism

It's OK to sulk a bit (we all do when we find out we are not a Nobel Prize winner in our first year of grad school), but get over it ASAP and learnfrom these comments. Most professors and advisors have much to share when it comes to the ins and outs of research design, writing for publication or finding grants. An old saying I always tell students and colleagues &ndash "One often remember the toughest teacher the most" &ndash is true for a reason.

4. Find the right dissertation chair for you

I always tell new PhD students that the chair of the program may not be the right choice &ndash or a brand new tenure track professor or the 30+ year professor in the department. Do your research! Do they "graduate" students in a timely manner, and are they decently well-known in their research field? Are they collegial?

One way to find a dissertation chair is to do some research via the internet, or you could talk to current graduate students about particular professors. The department might also be able to assist you on finding out the statistics on each professor. For example, I found out the start to finish time period for a graduate student and the PhD completion rate under "X" professor. In my personal opinion, you don't want a rookie professor that's trying to make tenure, and you don't want the retiring professor that may not be worried about research anymore. And it's OK if they are tough. If they teach you something and get you through the process, that's what matters. It's like parenting they shouldn't be your friend when they need to be your parent!

5. Direct your course research projects or independent study for course credit towards your dissertation

This could easily be my number one piece of advice. If you can conduct literature reviews or pilot research projects in your preparatory courses towards what you want to do your dissertation on, do it. This step will help you save time downstream in the dissertation phase. I turned three independent studies (with future dissertation committee members) into nine hours of completed doctoral coursework while also completing much of my first two chapters for the dissertation. Let me explain how I did this in more detail.

I always knew that I wanted to conduct a dissertation on Methicillin Resistant Staphylococcus aureus (MRSA) with regard to the knowledge, learning, and adaptation of individuals who had been diagnosed with MRSA. So, I went to the department chair of my PhD program and asked about opportunities to take independent study courses (electives) that would allow me to build towards conducting my literature review, pilot study and funding opportunities for my topic. By the time I reached the proposal stage, I truly had my first two chapters of my dissertation in good shape.

6. Keep your dissertation topic as narrow as possible

You may want to save the world, but do you want to spend 10 years on your PhD? You have a research life after the PhD is done to save the world. Certainly, if you want to win the Nobel Prize while working on your dissertation, then go for it, but be prepared for a long commitment. This is very important.

A narrow topic might seem like you will not have enough data or things to say. However, the longer I do research, the more often I see the value in a strong but narrow research design. Seek out active researchers in your core area of interest and discuss the "needs" of that research. Is there something missing from the literature? Are there research questions or hypotheses already being asked that need answering? These are great ways to narrow your topic and be relevant for publication.

7. There's a reason 50 percent of PhD candidates stay ABD.

Perseverance and finishing the job, in my humble opinion, are the two most important traits and qualities one needs after coursework is complete. As I tell my own two children, it's OK to fail but it's not OK to quit. Set an agenda and schedule with your dissertation chair and be accountable to it &ndash and keep your chair accountable. I met with my chair every three weeks during my dissertation and finished in one and a half years! It can be done. Don't let your chair or yourself off the hook on this item. Find the time to meet on a set schedule. I typically would promise my chair that I would have a portion of a chapter done before our meeting time.

And, don't alienate your chair by emailing them pages to edit the night before. Always be sure to give them the courtesy of at least a week of time to review your work prior to your set time. They are very busy too and it will be more productive if they have time to edit your pages in advance. Celebrate each hurdle that you clear so that you know you're are making progress.

8. Focus only on the next step or hurdle as you work

This can be very difficult &ndash to not stress out about the entire dissertation journey. It's so easy to become paralyzed by the mountain of checklists and things to do. This tip follows #7 for a reason. Set your agenda and schedule, and focus on what is immediately in front of you. Usually, the first step is forming your committee with a chair. Do that and celebrate. Then move to the next step, and the next:

  • Proposal/research design &ndash check
  • IRB (institutional review board) consent &ndash check
  • Pilot study &ndash check
  • Gather data &ndash check
  • Analysis &ndash check
  • Write, write, write with a purpose and schedule &ndash check
  • Defend &ndash check
  • Finish &ndash yes!

9. Find a strong quantitative (or qualitative) research colleague that will assist you with a strong design

This is a critical decision, and doing it early and correctly will make your dissertation matter so as not to end up on the shelf. It has been my experience that most poorly written or non-meaningful dissertations were a result of the wrong research design. If your university has a "go-to person" for a quantitative design, seek that person out. But, don't choose that person to be on your committee or to assist you if they are primarily a qualitative researcher.

If you are considering a mixed-methods approach, then you might consider that option. I have a very good friend who is an expert quantitative researcher that has won multiple funding awards on a variety of projects across multiple disciplines. He always states that this is the biggest weakness of dissertations &ndash a poor design. It's a national problem so don't ignore it. Find help if you need it. Get it right up front, and not only will it help you finish. It will make your work relevant and publish-worthy.

10. Promote your work and talk to others

This advice may not seem relevant for your dissertation. However, I would argue that you should do this not only on your campus but to go to graduate research forums, professional organizations for graduate research presentation, colleagues in your research area, and other routes to promote your work. Obviously, in today's world that might mean a good online blog, too. It can actually lead a solid sounding board for your research and may lead to job opportunities as you move into the final stages of your dissertation completion.[divider]

Now go do it. Concentrate on each step and see yourself finishing that step. Success is mostly about hard work and persistence. It's what separates the "almost finished" from a job well done. Nothing, in my experience, can take the place of sticktuitiveness. Good luck!

1. Bio-IT World

Boston, Massachusetts, United States About Blog It covers the application of informatics, IT and computer science in biomedical research and drug discovery. As the life sciences become an increasingly quantitative discipline, Bio-IT World provides topical news coverage and analysis of cutting-edge technologies to handle the data deluge in petascale computing and the tools to deliver individualized medicine. Frequency 3 posts / week Blog
Facebook fans 6.2K ⋅ Twitter followers 15.1K ⋅ Social Engagement 1 ⓘ ⋅ Domain Authority 58 ⓘ ⋅ View Latest Posts ⋅ Get Email Contact

2. BMC Bioinformatics

London, England, United Kingdom About Blog BMC Bioinformatics is an open access, peer-reviewed journal that considers articles on all aspects of the development, testing and novel application of computational and statistical methods for the modeling and analysis of all kinds of biological data, as well as other areas of computational biology. Frequency 11 posts / week Also in Bioinformatics Magazines Blog bmcbioinformatics.biomedcent..
Facebook fans 85.5K ⋅ Twitter followers 19.2K ⋅ Domain Authority 88 ⓘ ⋅ Alexa Rank 2.3K ⓘ View Latest Posts ⋅ Get Email Contact

3. RNA-Seq Blog | Transcriptome Research & Industry News

United States About Blog RNA-seq, also called 'Whole Transcriptome Shotgun Sequencing' refers to the use of high-throughput sequencing technologies to sequence cDNA in order to get information about a sample's RNA content. Frequency 10 posts / week Blog
Facebook fans 4.6K ⋅ Twitter followers 7.2K ⋅ Domain Authority 45 ⓘ ⋅ Alexa Rank 764.8K ⓘ View Latest Posts ⋅ Get Email Contact

4. - Computational biology and bioinformatics

London, England, United Kingdom About Blog Latest news and research from on the topic of Computational biology and bioinformatics. Nature Research is part of Springer Nature, a leading global research, educational and professional publisher. Springer Nature is the world's largest academic book publisher, publisher of the world's most influential journals and a pioneer in the field of open research. Frequency 30 posts / week Blog
Facebook fans 1M ⋅ Twitter followers 2.1M ⋅ Domain Authority 93 ⓘ ⋅ View Latest Posts ⋅ Get Email Contact

5. PLOS Computational Biology

San Francisco, California, United States About Blog By making connections through the application of computational methods among disparate areas of biology, PLOS Computational Biology provides substantial new insight into living systems at all scales, from the nano to the macro, and across multiple disciplines, from molecular science, neuroscience and physiology to ecology and population biology. Frequency 19 posts / week Blog
Facebook fans 98.7K ⋅ Twitter followers 128.9K ⋅ Domain Authority 92 ⓘ ⋅ View Latest Posts ⋅ Get Email Contact

6. News-Medical.Net - Bioinformatics News

United Kingdom About Blog News-Medical.Net aims to segment, profile and distribute medical news to the widest possible audience of potential beneficiaries worldwide and to provide a forum for ideas, debate and learning, and to facilitate interaction between all parts of the medical health sciences community worldwide. Frequency 9 posts / week Blog
Facebook fans 268.7K ⋅ Twitter followers 14K ⋅ Domain Authority 77 ⓘ ⋅ Alexa Rank 9.9K ⓘ View Latest Posts ⋅ Get Email Contact

7. BioInformatics Inc.

Washington, District of Columbia, United States About Blog Providing critical market intelligence to major suppliers serving the life science, analytical instrument, and clinical diagnostic markets. BioInformatics Inc's blog includes key findings from our scientific industry experts on everything from company acquisitions to customer relations. Frequency 1 post / quarter Blog
Twitter followers 3.5K ⋅ Domain Authority 36 ⋅ Alexa Rank 927.7K View Latest Posts ⋅ Get Email Contact

8. BaseSpace Informatics

San Diego, California, United States About Blog The BaseSpace Informatics Suite is a fully integrated, cloud-based informatics platform that unifies key functionality for quickly delivering high-quality genomic information. Frequency 2 posts / week Blog
Twitter followers 1.3K ⋅ Social Engagement 21 ⋅ Domain Authority 66 ⋅ Alexa Rank 60.7K View Latest Posts ⋅ Get Email Contact

9. Creative Diagnostics | Antibodies, Antigens, Elisa Kits for Life Science

New York, United States About Blog Creative Diagnostics manufactures and markets worldwide the highest quality innovative specialty immunoassays. Fully-automated and semi-automated system options are available utilizing advanced direct label technology to meet the throughput needs of both large and small independent and hospital laboratories. Frequency 2 posts / month Blog
Facebook fans 1.4K ⋅ Twitter followers 606 ⋅ Domain Authority 32 ⋅ View Latest Posts ⋅ Get Email Contact

10. Lifebit Blog

London, England, United Kingdom About Blog Hello. We blog genomics, personalized medicine, cloud, big data & AI. Welcome. Frequency 1 post / week Blog
Facebook fans 154 ⋅ Twitter followers 4.1K ⋅ Instagram Followers 118 ⋅ Domain Authority 30 ⋅ Alexa Rank 1.3M View Latest Posts ⋅ Get Email Contact

11. Seven Bridges Genomics - The biomedical data analysis company

Boston, Massachusetts, United States About Blog Seven Bridges is the biomedical data analysis company accelerating breakthroughs in genomics research for cancer, drug development and precision medicine. Frequency 1 post / quarter Blog
Domain Authority 41 ⋅ Alexa Rank 696.8K View Latest Posts ⋅ Get Email Contact

12. Dave Tang's blog

Perth, Western Australia, Australia About Blog I'm a genomics researcher who uses R and Perl (because I was exposed to bioinformatics around 2005). Most of this blog is on analyses related to genomics. Frequency 2 posts / quarter Blog
Twitter followers 1.7K ⋅ Social Engagement 2 ⋅ Domain Authority 30 ⋅ View Latest Posts ⋅ Get Email Contact

13. OBF News | Open Source Bioinformatics news

United States About Blog The Open Bioinformatics Foundation or OBF is a non profit, volunteer run organization focused on supporting open source programming in bioinformatics. Frequency 1 post / week Blog
Twitter followers 1.8K ⋅ Domain Authority 47 ⋅ View Latest Posts ⋅ Get Email Contact

14. Front Line Genomics | Bioinformatics

London, England, United Kingdom About Blog Our aim is to bring the benefits of genomics to patients faster by supporting scientists, clinicians, business/research leaders and officials. Frequency 10 posts / year Blog
Facebook fans 1.9K ⋅ Twitter followers 8.3K ⋅ Domain Authority 51 ⋅ Alexa Rank 1.8M View Latest Posts ⋅ Get Email Contact

15. Elucidata

Cambridge, Massachusetts, United States About Blog Elucidata's mission is to use data analytics to transform decision-making processes in R&D labs in biotechnology and pharmaceutical companies. On our Blog, you will find easy-to-understand and actionable insights to help your company improve its data management. Frequency 1 post / month Also in Data Management Blogs Blog
Facebook fans 1.4K ⋅ Twitter followers 231 ⋅ Social Engagement 1 ⋅ Domain Authority 28 ⋅ Alexa Rank 594.5K View Latest Posts ⋅ Get Email Contact

16. Bioinformatics Review

Delhi, India About Blog Bioinformatics Review provides latest bioinformatics research news and articles. Get the latest news, views and opinion at your fingertips. Frequency 1 post / day Also in Delhi Blogs Blog
Facebook fans 3.4K ⋅ Twitter followers 263 ⋅ Social Engagement 2 ⋅ Domain Authority 18 ⋅ Alexa Rank 2.2M View Latest Posts ⋅ Get Email Contact

17. T-BioInfo in Education - News and Updates in Bioinformatic Education

New Orleans, Louisiana, United States About Blog Follow our blog for the latest news, updates, and press from Tbio. Access our online courses to begin your bioinformatics education! Frequency 25 posts / year Blog
Facebook fans 9.6K ⋅ Twitter followers 984 ⋅ Social Engagement 3 ⋅ Domain Authority 22 ⋅ Alexa Rank 1.9M View Latest Posts ⋅ Get Email Contact

18. Living in an Ivory Basement

About Blog Gives information on science, testing and programming. Frequency 2 posts / month Blog
Domain Authority 50 ⋅ Alexa Rank 2.1M View Latest Posts ⋅ Get Email Contact

19. Bits of DNA

About Blog Reviews and commentary on computational biology by Lior Pachter. Frequency 1 post / quarter Blog
Twitter followers 24.6K ⋅ Social Engagement 143 ⋅ Domain Authority 45 ⋅ Alexa Rank 3.4M View Latest Posts ⋅ Get Email Contact

20. Python for Bioinformatics - adventures in bioinformatics

South Carolina, United States About Blog I teach and do research in Microbiology. This blog started as a record of my adventures learning bioinformatics and using Python. It has expanded to include Cocoa, R, simple math and assorted topics. Frequency 2 posts / month Blog
Domain Authority 18 ⋅ Alexa Rank 7.5M View Latest Posts ⋅ Get Email Contact

21. Omics! Omics!

Cambridge, Massachusetts, United States About Blog A computational biologist's personal views on new technologies & publications on genomics & proteomics and their impact on drug discovery. Frequency 3 posts / month Blog
Twitter followers 7.8K ⋅ Social Engagement 12 ⋅ Domain Authority 37 ⋅ View Latest Posts ⋅ Get Email Contact

22. Informatics Professor

Portland, Oregon, United States About Blog This blog maintains the thoughts on various topics related to biomedical and health informatics by Dr. William Hersh, Professor and Chair, Department of Medical Informatics & Clinical Epidemiology, Oregon Health & Science University. Frequency 1 post / month Blog informaticsprofessor.blogspo..
Twitter followers 2.9K ⋅ Social Engagement 8 ⋅ Domain Authority 33 ⋅ View Latest Posts ⋅ Get Email Contact

23. Fios Genomics

Edinburgh, Scotland, United Kingdom About Blog Fios Genomics is a bioinformatics provider of data analysis services to Pharma, CROs and academia for drug discovery & development and applied research across all species. Frequency 2 posts / week Blog
Facebook fans 60 ⋅ Twitter followers 1.2K ⋅ Domain Authority 25 ⋅ View Latest Posts ⋅ Get Email Contact

24. Michael S. Chimenti's Bioinformatics Blog

Iowa City, Iowa, United States About Blog A blog about bioinformatics, genomic research, and data analysis. Frequency 1 post / week Blog
Domain Authority 12 ⋅ View Latest Posts ⋅ Get Email Contact

25. Epistasis Blog - From the Computational Genetics Laboratory at the University of Pennsylvania

Philadelphia, Pennsylvania, United States About Blog Edward Rose Professor of Informatics, Director of the Institute for Biomedical Informatics, Director of the Division of Informatics in the Department of Biostatistics and Epidemiology, Senior Associate Dean for Informatics, The Perelman School of Medicine, University of Pennsylvania. Frequency 1 post / quarter Blog
Twitter followers 30.2K ⋅ Domain Authority 21 ⋅ View Latest Posts ⋅ Get Email Contact

26. Pine Biotech

New Orleans, Louisiana, United States About Blog Pine Biotech was established to bring computational big-data analysis tools to the Agrotech and Biotech industries to address industry-specific challenges. Here you can read the latest news and publications from Pine Biotech. Frequency 1 post / month Blog
Facebook fans 9.6K ⋅ Twitter followers 984 ⋅ Social Engagement 17 ⋅ Domain Authority 28 ⋅ View Latest Posts ⋅ Get Email Contact

27. Genomics Proteomics and Bioinformatics

About Blog Bioinforrmatics content curated by top Bioinformatics Influencers. Frequency 24 posts / quarter Blog
Domain Authority 22 ⋅ View Latest Posts ⋅ Get Email Contact

28. Omics Tutorials

Australia About Blog Bioinformatics, Genomics, Proteomics, and Transcriptomics. The information contained in this website is for teaching and learning purposes only. Frequency 8 posts / year Blog
Domain Authority 4 ⋅ View Latest Posts ⋅ Get Email Contact

29. Kevin's GATTACA World

About Blog My Weblog on Bioinformatics, Genome Science and Next Generation Sequencing. Frequency 1 post / quarter Blog
Domain Authority 22 ⋅ View Latest Posts ⋅ Get Email Contact

30. What You're Doing Is Rather Desperate

Sydney, New South Wales, Australia About Blog I'm Neil Saunders, a data scientist and bioinformatician based in Sydney, Australia. I've worked as a data scientist for a healthcare technology startup, as a researcher for the CSIRO and in several universities. Frequency 1 post / year Blog
Twitter followers 4.6K ⋅ Social Engagement 14 ⋅ Domain Authority 44 ⋅ Alexa Rank 8.8M View Latest Posts ⋅ Get Email Contact

31. Thermo Fisher Scientific | Connected Lab

About Blog Thermo Fisher Scientific Inc. is the world leader in serving science. Our mission is to enable our customers to make the world healthier, cleaner and safer. The Connected Lab focuses on the digital transformation of the lab via technology & software enabling lab automation, lab integration, and lab optimization. Frequency 1 post / week Blog
Facebook fans 116K ⋅ Twitter followers 57K ⋅ Domain Authority 83 ⋅ Alexa Rank 7.4K View Latest Posts ⋅ Get Email Contact

Watch the video: Refrigerator Not Cooling But Freezer Is Fine (December 2021).