Gene Set Enrichment Analysis (GSEA) in a Nutshell

Gene set enrichment analysis, or GSEA, is an analytical proteomic technique. Using GSEA, one can investigate the molecular pathways in which a certain set of proteins may be involved, and therefore elucidate function. Lists of protein names are fed into the software, and potential pathway connections are revealed.

This Java-based approach was developed by a team of scientists at UC San Diego and the Broad Institute with the goal of providing a comprehensive data-based method to identify connections between various biomolecular pathways. The initial database contained 1,325 biologically defined gene sets, revealing the thoroughness of the software.

Here is what GSEA output looks like:

An image of the raw data from the GSEA software (only from a singular database called WikiPathway). The vertically-oriented text names a cellular process, and the horizontally-oriented text names the proteins in the inputted data. A shaded box indicates that a given protein is involved in the designated pathway. This data is from a list of proteins found by mass spectrometry in endometrial sections. More information about this experiment can be found in my endometrial proteomics article.

One shortcoming of this method of data display is that it doesn’t account for “coincidental” participation of certain proteins in pathways. For example, suppose a given molecular pathway is very large (contains many protein participants) and involves an arbitrary Protein X. Say Protein X also participates in a smaller pathway, wherein less proteins are involved. The above form of data display would portray X’s contribution to both pathways as completely equal, a shaded box indicating participation. However, the degree to which Protein X contributes to the large pathway’s result, relative to the small pathway’s, is very different. In order to avoid this misrepresentation and quantify the differential contributions of proteins to a pathway, we use the k/K value.

The k/K value quantifies the level of participation that a particular protein has in a given cellular process. k is the number of common genes between the entered protein list and pathway (gene set) identified by the software. K is the total number of genes in the identified gene set. By calculating k/K, then, we are quantifying the level of participation of the members of our protein list in various pathways.

Simple manipulation of the GSEA data allows us to generate a graph of the k/K against various gene sets identified by the software. This provides an idea of which pathways are most significant given the proteins entered, allowing us to speculate potential functions of proteins in a certain tissue type. (For an example of this kind of analysis, visit my endometrial proteomics article!)

Here is what a k/K graph looks like:

This graph was made on Excel using stuff from the GSEA output.

I think GSEA is a great tool to form hypotheses based on preliminary proteomic data. I’m not sure if we can use this data for legit stuff, but I think it’s fun to play around and see what’s connected to what, and ponder about why.

That said, here are some helpful links about this topic:

http://www.gsea-msigdb.org/gsea/index.jsp

https://www.gsea-msigdb.org/gsea/doc/GSEAUserGuideFrame.html

http://www.gsea-msigdb.org/gsea/msigdb/help_annotations.jsp

The above 3 links are from the same main website (the official GSEA page) but I included them separately because I felt that these pages, in particular, are the most helpful to gain a basic understanding of how the software works.

———

https://www.pnas.org/doi/10.1073/pnas.0506580102 - This is a paper from the creators of GSEA, providing “a full mathematical description of the GSEA methodology” with examples of its use. Enjoy!

https://bioinformaticshome.com/tools/rna-seq/descriptions/GSEA-UCSD.html - Basic facts about GSEA.

Previous
Previous

Amino Acid Titrations Explained + Sample Graph Walkthrough

Next
Next

The Difference Between Preload, Afterload, Contractility, and Ejection Fraction