Differential expression analysis

There are many software packages for differential expression analysis of RNA-seq data. We will first look at Cuffdiff/CummeRbund and then try the alternative method DESeq.

Several tools, such as DESeq and edgeR, start from read counts per gene and use the discrete nature of the data to do statistical tests appropriate for this kind of data. It can be argued that that such counts will never give quite the correct results because the presence of alernative isoforms confound the read counting. Cuffdiff therefore combines isoform-level quantification and differential expression testing into one framework and claim that they achieve better results because they are able to take into account the uncertainty of isoform quantification.

This differential expression tutorial was originally written by Mikel Huss, and adapted and extended for this course by Pär Engström (both at SciLifeLab, DBB, Stockholm University).

Data set

In this exercise we are going to look at RNA-seq data from the A431 cell line. A431 is an epidermoid carcinoma cell line which is often used to study cancer and the cell cycle, and as a sort of positive control of epidermal growth factor receptor (EGFR) expression. A431 cells express very high levels of EGFR, in contrast to normal human fibroblasts. In the experiment we are looking at today, A431 cells were treated with gefinitib, which is an EGFR inhibitor, and is used (under the trade name Iressa) as a drug to treat cancers with mutated and overactive EGFR. In the experiment, RNA was extracted at four time points: before the gefinitib treatment (t=0), and two, six and twenty-four hours after treatment (t=2, t=6, t=24, respectively), and sequenced using an Illumina HiSeq instrument in triplicates (thus there are 3x4=12 samples). Since gefinitib suppresses the expression of EGFR, the effects on the cell line are quite severe, as you will see when comparing the 24h time point to the control.

Some of the files that we will use have been pre-constructed by us ahead of the lab because it would take too much time to do it during the lab exercise. In those cases, we have indicated the commands that we used to generate the files, so that you should be able to reproduce the results on your own.

CuffDiff and CummeRbund

We will start by looking at Cuffdiff and CummeRbund. These tools are part of the “Tuxedo” suite, which also includes Bowtie, TopHat and Cufflinks.

Cuffdiff, which you have already tried in an earlier exercise, is a command-line program that does the actual differential expression testing, and cummeRbund is an R package that reads Cuffdiff output and visualizes it in various nice ways.

Running Cuffdiff on the A431 RNA-seq data set would take too long for the purposes of this exercise and we have therefore executed it in advance so that you can access the results. For reference, these were the commands used:

# Just for reference; do not run this
ALI=/path/to/TopHat/alignments
cuffdiff -o cuffdiff_all --labels 0h,2h,6h,24h -p 8 \
  Homo_sapiens.GRCh37.71.gtf \
  $ALI/sample_1/accepted_hits.bam,$ALI/sample_2/accepted_hits.bam,$ALI/sample_3/accepted_hits.bam \
  $ALI/sample_4/accepted_hits.bam,$ALI/sample_5/accepted_hits.bam,$ALI/sample_6/accepted_hits.bam \
  $ALI/sample_7/accepted_hits.bam,$ALI/sample_8/accepted_hits.bam,$ALI/sample_9/accepted_hits.bam \
  $ALI/sample_10/accepted_hits.bam,$ALI/sample_11/accepted_hits.bam,$ALI/sample_12/accepted_hits.bam

You can find an archive with CuffDiff output for a pairwise differential expression analysis between all the time points here. Download and unpack this archive. This should give you a direcory called cuffdiff_all, containing the output files.

As mentioned above, CummeRbund is implemented as an R package. Start R and use the library() function to load the cummerRbund package:

library(cummeRbund)

If this does not work, CummeRbund might not be installed on your system. You can install and load it as follows:

source("http://bioconductor.org/biocLite.R")
biocLite("cummeRbund")
library(cummeRbund)

If you were running R on your local computer, you could bring up a tutorial with the command vignette("cummeRbund-manual"). However, this does not work over the remote connection we are using for this course. Go to the CummeRbund web page to access the manual.

As you can see in the manual, CummeRbund provides lots of ways to visualize Cuffdiff output. We will just look at some of the possible visualizations, but it should be easy for you to try other ones from the manual after you have gone through the examples here.

Start by loading the Cuffdiff results. The command below should work if your R working directory contains the folder with Cuffdiff output (cuffdiff_all); otherwise you will have to specify the path to this folder.

cuff <- readCufflinks("cuffdiff_all")

This will take a while the first time you run it. For subsequent runs on the same data, it will be faster because cummeRbund will have stored an SQL database containing the information it needs to disk.

As mentioned above, the cuffdiff_all directory contains information about differential expression for genes and isoforms between all pairs of time points. We could start by looking at an overview of how many genes that are differentially expressed at a false discovery rate (FDR) of 1%:

mySigMat <- sigMatrix(cuff, level="genes", alpha=0.01) # constructs the plot
mySigMat # displays the plot

It would work to just give the sigMatrix(cuff, level="genes", alpha=0.01) command but it might be nice to keep the plot as an R object as done above.

The resulting plot shows how many genes that are differentially expressed between each pair of time points. It is easy to show numbers corresponding to isoforms instead; this is generally the case for all cummeRbund functions:

sigMatrix(cuff, level="isoforms", alpha=0.01)

One way to see which genes that are differentially expressed in different comparisons is the following:

mySigTable <- getSigTable(cuff,alpha=0.01,level="genes")

So for the 24h vs control comparison, you could select the X0hvsX24h (R doesn’t want variable names to start with numbers so it adds an X in those cases) column from mySigTable and filter to those genes that have a 1 in that column:

diff_0_24 <- mySigTable[,"X0hvsX24h"]
cuffgenes.24h.ctrl <- names(diff_0_24[which(diff_0_24==1)])

Now you have a list of differentially expressed genes between 24h and control. This list could be saved to file using e g:

write.table(cuffgenes.24h.ctrl, file="cuffgenes.24h.ctrl.txt", quote=F, row.names=F, col.names=F)

If you would just like to plot the expression of one specific gene across the time points, you could do something like the following. Let’s use the EGFR gene since we know that it should be affected by the treatment in the experiment:

myGene <- getGene(cuff, "EGFR")
expressionPlot(myGene) # Will collapse replicates and only show gene level FPKM
expressionPlot(myGene, replicates=T) # Will show replicate FPKMs
expressionBarplot(myGene, replicates=T) # Show as bar plot instead

We can also plot FPKMs for all isoforms:

expressionPlot(isoforms(myGene), replicates=T)
expressionBarplot(isoforms(myGene), replicates=T) # Might be quite cluttered

Perhaps we are interested in genes that behave in the same way as our gene of interest. In that case we could use:

egfr.similar <- findSimilar(cuff, "EGFR", n=20) # Will find the 20 genes that are most similar to EGFR
egfr.similar.expr <- expressionPlot(egfr.similar,logMode=T,showErrorbars=F)
egfr.similar.expr

If we are not interested in a particular pattern but would like to find out which patterns seem to be present in the data, we can use cummeRbund’s built-in clustering method. Let’s say we assume that there are 10 different temporal patterns (the choice is up to you of course) and restrict the clustering to genes that are considered to be differentially expressed between at least one pair of time points. This will probably take a while to run:

sig.gene.ids <- getSig(cuff,alpha=0.01,level='genes')
sig.genes <- getGenes(cuff, sig.gene.ids)
cl <- csCluster(sig.genes,k=10)
clplot <- csClusterPlot(cl)
clplot

Finally, we can look at some plots that visualize how similar the samples are to each other. In a real project we would probably have started by doing this at the beginning to check if the data look OK.:

MDSplot(genes(cuff),replicates=T)

The MDS (multidimensional scaling) plot attempts to visualize the high-dimensional data (tens of thousands of genes) in a two-dimensional plot so that the distances between each sample are preserved as faithfully as possible.

There is also a heatmap function for samples. Of course there are also various heatmaps for gene lists; you can read more about that in the manual.

csDistHeat(genes(cuff))

DESeq

As mentioned above, the DESeq approach identifies differentially expressed genes based on counts of the number of reads mapped to each gene. DESeq is not limited to RNA-seq, but can be used for comparions of other count-based data, such gene expression profiles from tag sequencing or data from ChIP-seq experiments.

The DESeq method is implemented in the R packages DESeq and DESeq2. The latter is more recent, and recommended. The DESeq2 package is also available in several versions, tied to different versions of R (this applies to all Bioconductor packages). To use the most recent version of DESeq2, make sure you have the most recent R version installed. Also note that DESeq2 strictly requires R version 3.0 or above.

For this exercise we have pre-calculated read counts per gene (according to Ensembl annotations) with commands like:

# Only given for reference, not supposed to be executed during the lab
samtools view accepted_hits_137_1.bam | sort > accepted_hits_prehtseq_137_1.sam
htseq-count -s no -q accepted_hits_prehtseq_137_1.sam Homo_sapiens.GRCh37.71.gtf > 137_1.counts

and combined the counts into a single table. You will import this table into R and use DESeq2 to get a list of differentially expressed genes. Download the table here.

We will just look at the 24h vs control comparison here rather than doing anything more complex.

Start R and load the DESeq2 package:

library(DESeq2)

If this does not work, you may need to go through the usual drill to install the package:

source("http://bioconductor.org/biocLite.R")
biocLite("DESeq2")

The actual analysis is rather simple, after you have set up the data you are going to feed to DESeq2. Start by reading the count_table.txt file. Of course you need to be in the same directory as the file for the following command to run cleanly:

counts <- read.delim("count_table.txt")

You may want to look at the table with commands like head(counts). Next, you need to create a table with information about the samples:

samples <- data.frame(timepoint = rep(c("ctrl", "t2h", "t6h", "t24h"), each=3))

Look at the content of the data frame that you created. (Type the name of the object, samples, and press enter.) Note that the table only has one column, which indicates the time point for each of the 12 samples. For this simple experimental design, this is all we need: the timepoints define the groups that we wish to to compare. It doesn’t really matter what you call the groups, as long as the names are distinct. For example, we could have used “t0h” instead of “ctrl” for the first time point.

If instead of coming from a cell line, these samples were (say) tumor samples from different patients, such that for example samples 0h_1, 2h_1, 6h_1 and 24h_1 were all from the same person at different time points, the sample description could be extended by one column:

# Not to be used in the lab - just an example!
samples <- data.frame(timepoint = rep(c("ctrl", "t2h", "t6h", "t24h"), each=3), individual=rep(1:3, 4))

This would facilitate a so-called factorial design and specifying it for DESeq2 would potentially give more statistical power than just comparing groups “blindly”. However, we are not going to do this here. Now it’s time to construct the data set object that DESeq2 needs to perform the analysis:

ds <- DESeqDataSetFromMatrix(countData=counts, colData=samples, design=~timepoint)

This function call constructs a DESeq2 data set object using the arguments we provide: (1) count table; (2) sample description, and (3) experimental design. All three arguments are mandatory. The design is specified as a formula (another type of R object). In this case the formula is easy: the timepoints are really the only thing we can compare to each other. If we had had an additional factor as described above, we could have chosen to test for differences between timepoints while correcting for variability arising from individual differences, or to test for differences between individuals, while correcting for variation arising from the timepoints. For example:

# Just an example
ds <- DESeqDataSetFromMatrix(countData=counts, colData=expr.desc, design=~timepoint + individual) # to test for differences between individuals
ds <- DESeqDataSetFromMatrix(countData=counts, colData=expr.desc, design=~individual + timepoint) # to test for differences between timepoints

Now that we are set, we can proceed with the differential expression testing:

ds <- DESeq(ds)

This very simple function call does all the hard work. Briefly, this function performs three things:

  • Compute a scaling factor for each sample to account for differences in read depth and complexity between samples
  • Estimate the variance among samples
  • Test for differences in expression among groups (time points, in our case)

For more details, see the manual page for the function:

? DESeq

You can also have a look at the vignette for the DESeq2 package, which can be found on the DESeq2 web page. If you were running R locally, you would also be able to bring up the vignette with the command vignette("DESeq2").

Now we just need to extract the results, but how do we see the specific comparison we want, e.g. 24h vs control? We can use the following function to see which comparisons that have been made by DESeq2:

resultsNames(ds)

You should see names of different comparisons here. Choose one of them and give a command like:

res <- results(ds, "timepoint_t24h_vs_ctrl")

The resuling object is a table with test results for each gene in the original count table, i.e. all annotated genes, both protein-coding and non-coding. Use the function head() to inspect the results table.

Do you understand what the columns mean? You can see information such as the “base mean” (an average of the normalized mean counts per group), the log2 fold change between the groups, and the P-values (both “raw” and adjusted for multiple comparisons). If you are unsure how to interpret these data, discuss with one of the instructors.

We are not interested in results for all genes. For example, genes with zero or very low counts across all samples cannot be tested for differences in expression. Those will have a P-value of NA (not applicable). There are about 32,000 such genes, which we remove:

nrow(res)
sum( is.na(res$pvalue) )
res <- res[ ! is.na(res$pvalue), ]
nrow(res)

You probably want to focus on genes that are significant according to some criterion, such as false discovery rate (FDR) or log fold change. Filtering on the adjusted P-value (column padj) is equivalent to choosing a desired false discovery rate. For example, we can filter the results such that 1% are expected to be false positives (genes with no actual difference in expression):

sig <- res[ which(res$padj < 0.01), ]

(Note: in some versions of DESeq, you may need to use res$FDR instead of res$padj.)

How many significantly differentially expressed genes do you get? (Hint: try the dim() and nrow() functions).

To see the top “hits”, we can sort the filtered result list by statistical significance:

sig <- sig[ order(sig$padj), ]

We convert the table to a data frame, so that we can manipulate and view it more easily:

sig <- as.data.frame(sig)
head(sig, n=20)

If the table wraps over several lines, you can try to change some R options, before viewing the table:

options(width=120)  ## Display width (number of characters)
options(digits=5)   ## Number of digits to show for numbers
head(sig, n=20)

You might want to compare the results from CuffDiff and DESeq2. The names of the significant genes from DESeq2 can be easily obtained by:

deseqgenes.24h.ctrl <- rownames(sig)

If you still have the list of significant genes between 0h and 24h from the CuffDiff/cummeRbund analysis in your R session, or if you have saved it to file, you might want to check how many of them that were picked up by both programs:

common.24h.ctrl <- intersect(deseqgenes.24h.ctrl, cuffgenes.24h.ctrl)

The DESeq2 package contains a function plotMA() that can be used to visualize the differences in gene expression:

plotMA(ds)

Do you understand what this plot shows? Look at the manual page for the function, and run it again with the argument pvalCutoff set to different values. Discuss with an instructor if you are unsure how to interpret the plot.

If time allows, try some of the additional functions described in the DESeq2 vignette. For example, you may interested in the chapter on Data quality assessment, which describes ways to visualize results using heatmaps.

You may also want to try some of the examples from the cummeRbund manual.

Further reading

The algorithms used by Cuffdiff and DESeq are described in the papers by Trapnell et al. (2013) and Anders and Huber (2010) , respectively.

Both groups have also published descriptions of how to use their tools in Nature Protocols: Trapnell et al. (2012), Anders et al. (2013)

For a recent review and evaluation of a range of methods for normalization and differential expression analysis of RNA-seq data, see Rapaport et al. (2013).

Table Of Contents

Previous topic

RNA-seq tutorial

Next topic

Very quick introduction to R

This Page