The advent of whole-genome sequencing has produced a number of startling discoveries related to the human genome. For starters, we are hardly as complex as we would like to think; Homo sapiens lags behind many organisms in both the size of our genome and the number of genes it encodes. Take, for instance, the humble Arabidopsis thaliana, a flowering plant commonly used as a model organism. Despite the relatively simple genome, containing 1.35×108 base pairs (contrast with the human genome of approximately 3.3×109 base pairs), Arabidopsis contains around 6,000 more genes than humans. The award for largest vertebrate genome goes to the marbled lungfish (Protopterus aethiopicus) with 1.3×1011 base pairs, and even this is dwarfed by the Amoeba Polychaos dubium and its staggering 6.7×1011 base pairs.
Thus, it seems that the complexity of a genome says little about the overall complexity of the organism it encodes. Despite this, comparative analysis of multiple genomes has revealed much about the evolutionary linkage between species. By examining the similarities and differences within genes from multiple species, biologists have drawn links between evolutionary cousins. In addition, the rate at which gene conservation is lost can reveal when two species diverged. Scientists have also found a startling number of genes that emerged only recently. These “de novo genes” arise spontaneously in an organism and are therefore not found in any interrelated species. This seems to contradict one of the long standing hypotheses regarding the genesis of novel genes, that genes are formed through a duplication event followed by mutations that result in a new protein product. How, then, can we explain the presence of genes that seemingly develop out of thin air within a species?
The answer to this can be found within the very framework of our DNA. Most organism’s genomes consist of regions of non-coding DNA, segments that do not encode protein sequences. In Humans these regions account for up to 98% of the total DNA. Importantly, while non-coding regions do not encode proteins, they often contain functional regions that code for RNAs, centromeres or telomeres, and can help to control replication. While these sequences are typically well regulated, something interesting happens when the DNA suffers a mutation that produces a transcription start site. In these rare situations a gene is born – the DNA is transcribed to RNA and a new protein is produced.
In most situations it is likely that the de novo protein will be inactive, but it is possible for these proteins to become functional and, occasionally, useful. Conversely, detrimental or inactive genes will be more frequently lost. It is estimated that Humans might contain 40 or more de novo genes, although the function of many of these genes remains to be explored.
For more on this topic stay tuned for part two of our feature on de novo genes!