Introduction to the msigdbr package

Overview

Performing pathway analysis is a common task in genomics and there are many available software tools, many of which are R-based. Depending on the tool, it may be necessary to import the pathways into R, translate genes to the appropriate species, convert between symbols and IDs, and format the object in the required way.

The msigdbr R package provides Molecular Signatures Database (MSigDB) gene sets typically used with the Gene Set Enrichment Analysis (GSEA) software:

Installation

The package can be installed from CRAN.

install.packages("msigdbr")

Usage

Load package.

library(msigdbr)

Check the available species.

msigdbr_show_species()
#>  [1] "Bos taurus"               "Caenorhabditis elegans"   "Canis lupus familiaris"  
#>  [4] "Danio rerio"              "Drosophila melanogaster"  "Gallus gallus"           
#>  [7] "Homo sapiens"             "Mus musculus"             "Rattus norvegicus"       
#> [10] "Saccharomyces cerevisiae" "Sus scrofa"

Retrieve all human gene sets.

m_df = msigdbr(species = "Homo sapiens")
head(m_df)
#> # A tibble: 6 x 9
#>   gs_name     gs_id  gs_cat gs_subcat human_gene_s… species_n… entrez_g… gene_sym… sources
#>   <chr>       <chr>  <chr>  <chr>     <chr>         <chr>          <int> <chr>     <chr>  
#> 1 AAACCAC_MI… M12609 C3     MIR       ABCC4         Homo sapi…     10257 ABCC4     <NA>   
#> 2 AAACCAC_MI… M12609 C3     MIR       ABRAXAS2      Homo sapi…     23172 ABRAXAS2  <NA>   
#> 3 AAACCAC_MI… M12609 C3     MIR       ACTN4         Homo sapi…        81 ACTN4     <NA>   
#> 4 AAACCAC_MI… M12609 C3     MIR       ACVR1         Homo sapi…        90 ACVR1     <NA>   
#> 5 AAACCAC_MI… M12609 C3     MIR       ADAM9         Homo sapi…      8754 ADAM9     <NA>   
#> 6 AAACCAC_MI… M12609 C3     MIR       ADAMTS5       Homo sapi…     11096 ADAMTS5   <NA>

Retrieve mouse hallmark collection gene sets.

m_df = msigdbr(species = "Mus musculus", category = "H")
head(m_df)
#> # A tibble: 6 x 9
#>   gs_name   gs_id gs_cat gs_subcat human_gen… species_… entrez_g… gene_sym… sources       
#>   <chr>     <chr> <chr>  <chr>     <chr>      <chr>         <int> <chr>     <chr>         
#> 1 HALLMARK… M5905 H      ""        ABCA1      Mus musc…     11303 Abca1     Inparanoid,Ho…
#> 2 HALLMARK… M5905 H      ""        ABCB8      Mus musc…     74610 Abcb8     Inparanoid,Ho…
#> 3 HALLMARK… M5905 H      ""        ACAA2      Mus musc…     52538 Acaa2     Inparanoid,Ho…
#> 4 HALLMARK… M5905 H      ""        ACADL      Mus musc…     11363 Acadl     Inparanoid,Ho…
#> 5 HALLMARK… M5905 H      ""        ACADM      Mus musc…     11364 Acadm     Inparanoid,Ho…
#> 6 HALLMARK… M5905 H      ""        ACADS      Mus musc…     11409 Acads     Inparanoid,Ho…

Retrieve mouse C2 (curated) CGP (chemical and genetic perturbations) gene sets.

m_df = msigdbr(species = "Mus musculus", category = "C2", subcategory = "CGP")
head(m_df)
#> # A tibble: 6 x 9
#>   gs_name   gs_id gs_cat gs_subcat human_gen… species_… entrez_g… gene_sym… sources       
#>   <chr>     <chr> <chr>  <chr>     <chr>      <chr>         <int> <chr>     <chr>         
#> 1 ABBUD_LI… M1423 C2     CGP       AHNAK      Mus musc…     66395 Ahnak     Inparanoid,Ho…
#> 2 ABBUD_LI… M1423 C2     CGP       ALCAM      Mus musc…     11658 Alcam     Inparanoid,Ho…
#> 3 ABBUD_LI… M1423 C2     CGP       ANKRD40    Mus musc…     71452 Ankrd40   Inparanoid,Ho…
#> 4 ABBUD_LI… M1423 C2     CGP       ARID1A     Mus musc…     93760 Arid1a    Inparanoid,Ho…
#> 5 ABBUD_LI… M1423 C2     CGP       BCKDHB     Mus musc…     12040 Bckdhb    Inparanoid,Ho…
#> 6 ABBUD_LI… M1423 C2     CGP       C16orf89   Mus musc…    239691 AU021092  Inparanoid,Ho…

The msigdbr() function output can also be manipulated as a standard data frame.

m_df = msigdbr(species = "Mus musculus") %>% dplyr::filter(gs_cat == "H")
head(m_df)
#> # A tibble: 6 x 9
#>   gs_name   gs_id gs_cat gs_subcat human_gen… species_… entrez_g… gene_sym… sources       
#>   <chr>     <chr> <chr>  <chr>     <chr>      <chr>         <int> <chr>     <chr>         
#> 1 HALLMARK… M5905 H      ""        ABCA1      Mus musc…     11303 Abca1     Inparanoid,Ho…
#> 2 HALLMARK… M5905 H      ""        ABCB8      Mus musc…     74610 Abcb8     Inparanoid,Ho…
#> 3 HALLMARK… M5905 H      ""        ACAA2      Mus musc…     52538 Acaa2     Inparanoid,Ho…
#> 4 HALLMARK… M5905 H      ""        ACADL      Mus musc…     11363 Acadl     Inparanoid,Ho…
#> 5 HALLMARK… M5905 H      ""        ACADM      Mus musc…     11364 Acadm     Inparanoid,Ho…
#> 6 HALLMARK… M5905 H      ""        ACADS      Mus musc…     11409 Acads     Inparanoid,Ho…

Integrating with Pathway Analysis Packages

Use the gene sets data frame for clusterProfiler (for genes as Entrez Gene IDs).

m_t2g = m_df %>% dplyr::select(gs_name, entrez_gene) %>% as.data.frame()
enricher(gene = gene_ids_vector, TERM2GENE = m_t2g, ...)

Use the gene sets data frame for clusterProfiler (for genes as gene symbols).

m_t2g = m_df %>% dplyr::select(gs_name, gene_symbol) %>% as.data.frame()
enricher(gene = gene_symbols_vector, TERM2GENE = m_t2g, ...)

Use the gene sets data frame for fgsea.

m_list = m_df %>% split(x = .$gene_symbol, f = .$gs_name)
fgsea(pathways = m_list, ...)

Questions and Concerns

Which version of MSigDB was used?

This package was generated with MSigDB v7.0 (released August 2019). The MSigDB version is used as the base of the package version. You can check the installed version with packageVersion("msigdbr").

Can I download the gene sets directly from MSigDB instead of using this package?

Yes. You can then import the GMT files (with getGmt() from the GSEABase package, for example). The GMTs only include the human genes, even for gene sets generated from mouse data. If you are not working with human data, you then have to convert the MSigDB genes to your organism or your genes to human.

Can I convert between human and mouse genes just by adjusting gene capitalization?

That will work for most genes, but not all.

Can I convert human genes to any organism myself instead of using this package?

Yes. A popular method is using the biomaRt package. You may still end up with dozens of homologs for some genes, so additional cleanup may be helpful.

Aren’t there already other similar tools?

There are a few other resources that and provide some of the functionality and served as an inspiration for this package. Ge Lab Gene Set Files has GMT files for many species. WEHI provides MSigDB gene sets in R format for human and mouse, but the genes are provided only as Entrez IDs and each collection is a separate file. MSigDF is based on the WEHI resource, so it provides the same data, but converted to a more tidyverse-friendly data frame. When msigdbr was initially released, all of them were multiple releases behind the latest version of MSigDB, so they are possibly no longer maintained.

Details

The Molecular Signatures Database (MSigDB) is a collection of gene sets originally created for use with the Gene Set Enrichment Analysis (GSEA) software.

Gene homologs are provided by HUGO Gene Nomenclature Committee at the European Bioinformatics Institute which integrates the orthology assertions predicted for human genes by eggNOG, Ensembl Compara, HGNC, HomoloGene, Inparanoid, NCBI Gene Orthology, OMA, OrthoDB, OrthoMCL, Panther, PhylomeDB, TreeFam and ZFIN. For each human equivalent within each species, only the ortholog supported by the largest number of databases is used.