A common question we're asked is "how many reads should I use to sequence a sample?" I'm going to focus on genomes, exomes and amplicomes in this post and introduce the Lander-Waterman equation . Other apps are more complex because the number is very much 'how long is a piece of string' for RNA-seq, ChIP-seq and other counting applications - it depends on the complexity of your sample and the sensitivity you'd like to get, but is also affected by the number of replicates you have.
|The Lander-Waterman equation|
Lander-Waterman: Almost everyone doing NGS is using this equation, even if they are not aware of it. Anyone under 27 was born after it was published (1988), but it is an equation that is good to understand if you are sequencing. Basically it allows you to estimate how many reads of a specific length you need to sequence your genome.
The general equation is C = LN/G where: C = redundancy of coverage, G is the haploid genome size, L is the sequence read length, and N is the number of sequence reads. It can be rearranged to N = CG/L allowing you to compute the number of reads to sequence a genome, exome or amplicome (amplicon-panel) to a desired coverage (this is what we typically discuss when designing experiments).
In the examples below paired-end reads of 125bp from each end of a fragment are used, but these are converted to single 250bp reads for simplicity.
- Human genome (3Gb) 30x coverage = 360M reads.
- Human exome (150Mb) 50x coverage = 30M reads.
- Human amplicome (30x250bp amplicons 0.075Gb) 1000x coverage = 0.3M reads.
 Lander, E. S. & Waterman, S. Genomic Mapping by Fingerprinting Random Clones : A Mathematical Analysis. Genomics 239, 231–239 (1988).
Eric Lander founded both the Whitehead and Broad Institutes. Michael S. Waterman is one of the founders of computational biology and gave his name to another important algorithm: Smith-Waterman alignment, he also wrote Computational Genome Analysis with our Director Simon Tavare while at the University of Southern California