Information

Map gene IDs to Ensembl gene ID


Forgive me if this question is too trivial.

I have the gene IDs of the following type

EOG6STSR2 EOG60ZRJB EOG6SBFJ2 EOG6P5KX3 EOG6B5PRW

from the first supplementary file in Comparative validation of the D. melanogaster modENCODE transcriptome annotation .

I'm not sure as to the type of the IDs. I learned that IDs beginning with EOG are Eukaryotic Orthologous Group IDs but I couldn't find these particular ones in any of the databases.

Is it possible to map them to Ensembl IDs or any of the other commonly used ones?


These are not gene ids, but groups of orthologous genes -- try to search for the ids on Google, I can find this page http://cegg.unige.ch/orthodb6/fasta.fasta?ogs=EOG6STSR2&swaptree= and thus you should be able to download the sequences and ids from OrthoDB.

Note that, strictly speaking, the are no Ensembl ids for Drosophila, only FlyBase ids that are also used by Ensembl.


Identifier Mapping

This protocol will show you how to map or translate identifiers from one database (e.g., Ensembl) to another (e.g, Entrez Gene). This is a common requirement for data analysis. In the context of Cytoscape, for example, identifier mapping is needed when you want to import data to overlay on a network but the keys in the data don't match those in the network. This protocol includes two distinct examples highlighting different lessons that may apply to your use case species-specific mapping and protein to gene mapping .

Detailed information about the Cytoscape ID Mapper tool is available in Identifier Mapping in Cytoscape: idmapper (F1000 Research)


PyEnsembl also allows arbitrary genomes via the specification of local file paths or remote URLs to both Ensembl and non-Ensembl GTF and FASTA files. (Warning: GTF formats can vary, and handling of non-Ensembl data is still very much in development.)

The EnsemblRelease object has methods to let you access all possible combinations of the annotation features gene_name, gene_id, transcript_name, transcript_id, exon_id as well as the location of these genomic elements (contig, start position, end position, strand).


Gene ID mapping using R

Inter-conversion of gene ID’s is the most important aspect enabling genomic and proteomic data analysis. There are multiple tools available each with its own drawbacks. While performing enrichment analysis on Mass Spectrometry datasets, I had always struggled to prepare the input files required for each of the packages in R. It takes some data tweaking and cleanup to enable the R tools or packages to accept them as an input. The struggle is more in case of UniProt id’s as very few applications accept them as input. Although UniProt provides the retrieve id mapping function, it does not take into account the number of rows which means any protein or gene id which cannot be mapped is simply omitted from the output file. This makes combining the datasets difficult.

There are numerous tools available for such kind of ID mapping. Here I am laying out a few R packages that I have used and worked smoothly.

The org.Hs.eg.db package or the org.Mm.eg.db package is to be used for human and mice respectively. mapIds can take any input form like UniProt id, HGNC symbol, Ensembl id and Entrez id and interconvert them.

mapIds() returns a named vector of id’s.

The output can be merged to the original dataset using `cbind` for further downstream analysis. The one advantage that I have noticed with mapIds is that it matches the gene id’s row by row and inserts NA when it can’t find gene names or symbols for certain UniProt id’s. This is a huge lifesaver when working with huge datasets.

For human hgnc_symbol and for mouse mgi_symbol is to be used.

Generally, with biomaRt, extra work is required after you perform the initial mapping. You will note that biomaRt does not even return the genes in the same order in which they were submitted.

The ClusterProfiler package was developed by Guangchuang Yu for statistical analysis and visualization of functional profiles for genes and gene clusters. The org.Hs.eg.db or the org.Mm.eg.db package is to be used for human and mice respectively. The key types can be obtained by typing keytypes(org.Mm.eg.db) .

Apart from the R functions listed above there are various tools for gene ID conversion like DAVID, UCSC gene ID converter etc. for non-programmers.


Is Ensembl ID to Gene Symbol mapping platform-specific?

I have a GEO RNA-Seq dataset but its platform annotation data is missing. I want to map its ENSG IDs to gene symbols, e.g. "ENSG00000223972.5" and "ENSG00000078808.16". Is it possible to do map these to gene symbols in an accurate manner without platform-specific annotations?

FWIW, the platform is GPL11154 and the dataset is GSE107011

Which software environment will you be using? In R you could use AnnotationDbi coupled with org.Hs.eg.db to map the Ensembl IDs back to gene symbols. First you would need to remove the .[0-5] at the end of each Ensembl ID, which represents the version of the gene (source). Next, you can map the IDs from Ensembl to gene symbols.

Assuming you work in R and that you have performed the DE analysis and stored the results in an object called degs , my code would look like this:

Note that some Ensembl IDs might have several gene symbols (which is also why it is not generally advisable to use gene symbols as gene identifiers before running the analysis). The code above maps it back to the first one only.

Pretty sure something similar would be possible in Python or other languages. The only thing you would need is a database containing both the gene symbol and the Ensembl IDs (Biomart is typically a good choice).


Map gene IDs to Ensembl gene ID - Biology

8 hours due to maintenance in our data center. This interval could potentially be shorter depending on the progress of the work. We apologize for any inconvenience. *** --> *** DAVID will be down from 5pm EST Friday 6/24/2011 to 3pm EST Sunday 6/26/2011 due to maintenance in our data center. This interval could potentially be shorter depending on the progress of the work. We apologize for any inconvenience. *** --> *** We are currently accepting Beta users for our new DAVID Web Service which allows access to DAVID from various programming languages. Please contact us for access. *** --> *** The Gene Symbol mapping for list upload and conversion has changed. Please see the DAVID forum announcement for details. --> *** Announcing the new DAVID Web Service which allows access to DAVID from various programming languages. More info. *** --> *** DAVID 6.8 will be down for maintenance on Thursday, 2/23/2016, from 9AM-1PM EST *** -->
*** Welcome to DAVID 6.8 ***
*** If you are looking for DAVID 6.7, please visit our development site. ***
-->
*** Welcome to DAVID 6.8 with updated Knowledgebase ( more info). ***
*** If you are looking for DAVID 6.7, please visit our development site. ***
-->
*** Welcome to DAVID 6.8 with updated Knowledgebase ( more info). ***
*** The DAVID 6.7 server is currently down for maintenance. ***
--> *** Please read: Due to data center maintenance, DAVID will be offline from Friday, June 17th @ 4pm EST through Sunday, June 19th with the possibility of being back online sooner. *** -->


Tools for conversion of probe IDs

Whichever tool you use, remember to take note of the underlying mapping of probes to bioentities (i.e. transcripts/genes/proteins) that is used. While probe sequences don’t change, genome assemblies (e.g. chromosomal sequences) and annotation of bioentities are both subject to change over time. You may find that a certain probe which mapped to gene X six months ago is now mapped to gene Y because gene X has been made obsolete, or its exon-intron structure has changed in light of new supporting evidence.

If you have a small list of probe IDs, you can use the conversion tool in the Ensembl Genome Browser. For some common microarray platforms (Affymetrix, Agilent and Illumina), Ensembl regularly maps the probes/probe sets against the latest set of transcript models. To search, simply use individual probe identifiers as search terms in Ensembl (e.g. Agilent probe ID A_14_P109686). Alternatively, different web tools offer probe conversion, such as DAVID.

If you have a long list of probe IDs, R/Bioconductor offers a range of annotation packages that can be used to convert probe IDs during the microarray analysis workflow.


Map gene IDs to Ensembl gene ID - Biology

GeneWalk determines for individual genes the functions that are relevant in a particular biological context and experimental condition. GeneWalk quantifies the similarity between vector representations of a gene and annotated GO terms through representation learning with random walks on a condition-specific gene regulatory network. Similarity significance is determined through comparison with node similarities from randomized networks.

To install the latest release of GeneWalk (preferred):

To install the latest code from Github (typically ahead of releases):

GeneWalk uses a number of resource files that it downloads as needed during runtime. To optionally pre-download these resource files in the default resource folder, the command

GeneWalk always requires as input a text file containing a list with genes of interest relevant to the biological context. For example, differentially expressed genes from a sequencing experiment that compares an experimental versus control condition. GeneWalk supports gene list files containing HGNC human gene symbols, HGNC IDs, human Ensembl gene IDs, MGI mouse gene IDs, RGD rat gene IDs, or human or mouse entrez IDs. GeneWalk internally maps these IDs to human genes.

For organisms other than human, mouse or rat, there are two options. The first is to map the genes to human orthologs yourself and then input the human ortholog list as described above. Use this strategy if you consider the organism sufficiently related to human. The second option is to provide an input gene file with custom gene IDs. These are not mapped to human genes. Use custom gene IDs for more divergent organisms, such as drosophila, worm, yeast, plants or bacteria. In this case the user must also provide a custom gene network with GO annotations as input. See section Custom input networks for more details.

Each line in the gene input file contains a gene identifier of one of the above types.

GeneWalk command line interface

Once installed, GeneWalk can be run from the command line as genewalk , with a set of required and optional arguments. The required arguments include the project name, a path to a text file containing a list of genes, and an argument specifying the type of gene identifiers in the file.

Below is the full documentation of the command line interface:

GeneWalk automatically creates a genewalk folder in the user's home folder (or the user specified base_folder). When running GeneWalk, one of the required inputs is a project name. A sub-folder is created for the given project name where all intermediate and final results are stored. The files stored in the project folder are:

  • genewalk_results.csv - The main results table, a comma-separated values text file. See below for detailed description.
  • genes.pkl - A processed representation of the given gene list, in Python pickle (.pkl) binary file format.
  • multi_graph.pkl - A networkx MultiGraph resembling the GeneWalk network which was assembled based on the given list of genes, an interaction network, GO annotations, and the GO ontology.
  • deepwalk_node_vectors_*.pkl - A set of learned node vectors for each analysis repeat for the graph.
  • deepwalk_node_vectors_rand_*.pkl - A set of learned node vectors for each analysis repeat for a random graph.
  • genewalk_rand_simdists.pkl - Distributions constructed from repeats.
  • deepwalk_*.pkl - A DeepWalk object for each analysis repeat on the graph (only present if save_dw argument is set to True).
  • deepwalk_rand_*.pkl - A DeepWalk object for each analysis repeat on a random graph (only present if save_dw argument is set to True).

GeneWalk also automatically generates figures to visualize its results in the project/figures sub-folder:

  • index.html : an HTML page that includes all the figures generated, as described below.
  • barplots with GO annotations ranked by relevance for each input gene that GeneWalk was able to generate results for. The filenames contain the corresponding human gene symbol and input gene id: barplot_[symbol]_[gene id]_x_mlog10global_padj_y_GO.png .
  • regulators_x_gene_con_y_frac_rel_go(.png and .pdf) : scatter plot to identify regulator genes of interest. These have a large gene connectivity and high fraction of relevant GO annotations. For more information see our publication.
  • genewalk_regulators.csv : list with regulator genes that are named in the regulators scatterplot.
  • moonlighters_x_go_con_y_frac_rel_go(.png and .pdf) : scatter plot to identify moonlighting genes: genes with many GO annotations of which a low fraction are relevant. For more information see our publication.
  • genewalk_moonlighters.csv : list with moonlighting genes that are named in the moonlighting scatterplot.
  • genewalk_scatterplots.csv : data corresponding to the regulator and moonlighter scatter plots. This file can be used for further gene prioritization analyses.

GeneWalk results file description

genewalk_results.csv is the main GeneWalk output table, a comma-separated values text file with the following column headers:

  • hgnc_id - human gene HGNC identifier.
  • hgnc_symbol - human gene symbol.
  • go_name - GO term name.
  • go_id - GO term identifier.
  • go_domain - Ontology domain that GO term belongs to (biological process, cellular component or molecular function).
  • ncon_gene - number of connections to gene in GeneWalk network.
  • ncon_go - number of connections to GO term in GeneWalk network.
  • global_padj - false discovery rate (FDR) adjusted p-value of the similarity between gene and GO term, when correcting for testing over all gene-GO term pairs present in the output file. This is the key statistic that indicates how relevant the gene-GO term pair (gene function) is in the particular biological context or tested condition. Global_padj should be used for global analyses that consider all the GeneWalk output simultaneously, such as gene prioritization procedures. GeneWalk determines an adjusted p-value with Benjamini Hochberg FDR correction for multiple testing of all connected GO term for each nreps_graph repeat analysis. The value presented here is the average (mean estimate) over all p-adjust values from all nreps_graph repeat analyses.
  • gene_padj - FDR adjusted p-value of the similarity between gene and GO term, when correcting for multiple testing over all GO annotations of that gene. This the key statistic when investigating the functions of one (or a few) pre-defined gene(s) of interest. Gene_padj determines the statistical significance of each GO annotation (function) and gene_padj can be used to sensitively rank GO annotations to reflect the relevance to the gene of interest in the particular biological context or tested condition. When you consider all (or many) input genes simultaneously, use global_padj instead. Average over nreps_graph repeat runs as for global_padj.
  • pval - p-value of gene - GO term similarity, not corrected for multiple hypothesis testing. Average over nreps_graph repeat runs.
  • sim - gene - GO term (cosine) similarity, average over nreps_graph repeat runs.
  • sem_sim - standard error on sim (mean estimate).
  • cilow_global_padj - lower bound of 95% confidence interval on global_padj (mean estimate) from the nreps_graph repeat analyses.
  • ciupp_global_padj - upper bound of 95% confidence interval on global_padj.
  • cilow_gene_padj - lower bound of 95% confidence interval on gene_padj (mean estimate) from the nreps_graph repeat analyses.
  • ciupp_gene_padj - upper bound of 95% confidence interval on gene_padj.
  • cilow_pval - lower bound of 95% confidence interval on pval (mean estimate) from the nreps_graph repeat analyses.
  • ciupp_pval - upper bound of 95% confidence interval on pval.
  • mgi_id, rgd_id, ensembl_id, entrez_human or entrez_mouse - in case one of these gene identifiers were provided as input, the GeneWalk results table starts with an additional column to indicate the gene identifiers. In the case of mouse genes, the corresponding hgnc_id and hgnc_symbol resemble its human ortholog gene used for the GeneWalk analysis.

Run time and stages of GeneWalk algorithm

Recommended number of processors (optional argument: nproc) for a short (1-2h) run time is 4:

By default GeneWalk will run with 1 processor, resulting in a longer overall run time: 6-12h. Given a list of genes, GeneWalk runs three stages of analysis:

  1. Assembling a GeneWalk network and learning node vector representations by running DeepWalk on this network, for a specified number of repeats. Typical run time: one to a few hours.
  2. Learning random node vector representations by running DeepWalk on a set of randomized versions of the GeneWalk network, for a specified number of repeats. Typical run time: one to a few hours.
  3. Calculating statistics of similarities between genes and GO terms, and outputting the GeneWalk results in a table. Typical run time: a few minutes.
  4. Visualization of the GeneWalk results generated in the project/figures subfolder. Typical run time: 1-10 mins depending on the number of input genes.

GeneWalk can either be run once to complete all these stages (default), or called separately for each stage (optional argument: stage). Recommended memory availability on your operating system: 16Gb or 32Gb RAM. GeneWalk outputs the uncertainty (95% confidence intervals) of the similarity significance (global and gene p-adjust). Depending on the context-specific network topology, this uncertainty can be large for individual gene - function associations. However, if overall the uncertainties turn out very large, one can set the optional arguments nreps_graph to 10 (or more) and nreps_null to 10 to increase the algorithm's precision. This comes at the cost of an increased run time.

By default, GeneWalk uses the PathwayCommons resource ( --network_source pc ) to create a human gene network. It then automatically adds edges representing GO annotations for input genes and ontology relations between GO terms. However, there are options to run GeneWalk with a custom network as an input.

First, specify the --network_source argument as one of the alternative sources: .

If custom gene IDs are used ( --id_type custom ) in the input gene list, for instance from a model organism: choose as network source sif_annot or sif_full .

Then, include the argument --network_file with the path to the custom network input file. The network file format has to correspond to the chosen --network_source , as follows.

The sif/sif_annot/sif_full options require the network file in a simple interaction file (SIF) format. Each row of the SIF text file consists of three comma-separated entries representing source, relation type, and target. The relation type is not explicitly used by GeneWalk, and can be set to an arbitrary label.

The difference between the sif , sif_annot , and sif_full options:

  • sif : the input SIF can contain only human gene-gene relations. Genes have to be encoded as human HGNC gene symbols (for example KRAS). GO annotations for genes, as well as ontology relations between GO terms are added automatically by GeneWalk.
  • sif_annot : the input SIF has to contain both gene-gene relations, and GO annotations for genes: rows where the source is a gene, and the target is a GO term. Use GO IDs with prefix (for example GO:0000186) to encode GO terms. Genes should be encoded the same as in the gene input list and do not have to correspond to human genes. Ontology relations between GO terms are then added automatically by GeneWalk.
  • sif_full : the input SIF has to contain all GeneWalk network edges: gene-gene relations, GO annotations for genes, and ontology relations between GO terms. GeneWalk does not add any more edges to the network. Encode genes and GO terms in the same manner as for sif_annot .

The edge_list option is a simplified version of the sif option. It requires a network text file that contains rows with two columns each, a source and a target. In other words, it omits the relation type column from the SIF format. Further file preparation requirements are the same as for the sif option.

The indra option requires as custom network input file a Python pickle file containing a list of INDRA Statements. These statements can represent human gene-gene, as well as gene-GO relations from which network edges are derived. Human GO annotations and ontology relations between GO terms are then added automatically by GeneWalk during network construction.

For a tutorial and more general information see the GeneWalk website.
For further code documentation see our readthedocs page.

Robert Ietswaart, Benjamin M. Gyori, John A. Bachman, Peter K. Sorger, and L. Stirling Churchman
GeneWalk identifies relevant gene functions for a biological context using network representation learning,
Genome Biology 22, 55 (2021). https://doi.org/10.1186/s13059-021-02264-8

This work was supported by National Institutes of Health grant 5R01HG007173-07 (L.S.C.), EMBO fellowship ALTF 2016-422 (R.I.), and DARPA grants W911NF-15-1-0544 and W911NF018-1-0124 (P.K.S.).


Usage

A character vector of Latin names of species present in this scRNA-seq dataset. This is used to retrieve Ensembl information from biomart.

Character vector of paths to the transcriptome FASTA files used to build the kallisto index. Exactly one of species and fasta_file can be missing.

Path to the kallisto bus output directory.

A character vector indicating the type of each species. Each element must be one of "vertebrate", "metazoa", "plant", "fungus", and "protist". If length is 1, then this type will be used for all species specified here. Can be missing if fasta_file is specified.

Other arguments passed to tr2g_ensembl such as other_attrs , ensembl_version , and arguments passed to useMart . If fasta_files is supplied instead of species , then this will be extra argumennts to tr2g_fasta , such as use_transcript_version and use_gene_version .


And to do it using transcripts you do it like this:

The key difference is that the TXSTART refers to the start of a transcript and originates in the TxDb object from the TxDb.Hsapiens.UCSC.hg19.knownGene package, while the CHRLOC refers to the same thing but originates in the OrgDb object from the org.Hs.eg.db package. The point of origin is significant because the TxDb object represents a transcriptome from UCSC and the OrgDb is primarily gene centric data that originates at NCBI. The upshot is that CHRLOC will not have as many regions represented as TXSTART, since there has to be an official gene for there to even be a record. The CHRLOC data is also locked in for org.Hs.eg.db as data for hg19, whereas you can swap in a different TxDb object to match the genome you are using to make it hg18 etc. For these reasons, we strongly recommend using TXSTART instead of CHRLOC. Howeverm CHRLOC still remains in the org packages for historical reasons.