Similarities in Biology: Rice Expression Database

Background

DNA can be considered the instruction book of a living organism. It holds all the necessary information to create a new cell. Such information is stored as a sequence of genes, where each gene describes a particular feature of the new cell to be created.By gene expression we define the process that from the gene create the respective feature, also referred as phenotype. However, not all genes are always expressed. Sometimes the process of conversion may stop before the completion of its corresponding feature, and sometimes it may not start at all.

The expression rate of a gene is measured in Fragments Per Kilobase of transcript per Million mapped reads (FPKM).  It represents a normalization of the fragment count with respect to both the depth of the sequence read and the length of the gene itself (measured in kilobases).

Rice Expression Database (RED)

The Rice Expression Database (RED) is an open source databases that contains gene expression profiles derived entirely by RNA sequencing. The data come from different growth stages of rice undergoing various biotic and abiotic treatments. For each gene the expression level is given across a range of different tissues taken in 284 experiments.

Similarity Measure

Given the scope of the database, we decided to measure the similarity between two entries in terms of how close they get expressed across different tissues.

The similarity was computed with the Manhattan distance between the expression levels of two genes across the entire range of tissues provided. The Manhattan distance works perfectly for our case since it provides a singular unbiased index of how close the values (expression levels) are across each dimension (experiment).

Results

The two genes we identified as closely related are LOC_Os04g54790 and LOC_Os09g07660.

The computed Manhattan distance between these two genes was: 3264.08 FPKM

To make such number more meaningful, Figure 1 shows the plots of the expression levels of the two genes across some different tissues. Although there are differences, a common pattern is clearly visible.

Figure 1
Figure 1

To provide a further sense of the meaning of the measure, we have compared gene LOC_Os04g54790 with a random gene LOC_Os03g18120.

The computed Manhattan distance was: 13808.87

As expected, the value is much greater than before.

As a further proof, Figure 2 shows the plots of the genes. The difference is clearly visible.

Figure 2
Figure 2

Python script: Download

 

References

Rice Expression Database: http://expression.ic4r.org/index

LOC_Os04g54790: http://ic4r.org/genes/Os04g0640500

LOC_Os09g07660: http://ic4r.org/genes/Os09g0250700

LOC_Os03g18120: http://ic4r.org/genes/Os03g0291200