The Genomics Core blog

Thursday 2 February 2017

Understanding your @10Xgenomics cell ranger reports

The Cell Ranger analysis provided by 10X is an excellent start to understanding what might be going on in the single cells you just sequenced. It allows some basic QC and this can help determine how well your experiment is working. There is a high degree of variability in the number of cells captured and capture efficiency, but right now we cannot easily see if this is down to the sample (most likely) or the technology.

Some of the metrics are easy to interpret e.g. the ‘Estimated Number of Cells’ (how many single cells were captured) – the more the merrier! Others need to be compared across runs to determine what the “correct” parameters for an experiment might be e.g. the current 10X recommendation for ‘Mean Reads per Cell’ is 50,000, but you may find that more, or fewer, reads are required for your samples. You can use the other metrics such as ‘Median Genes per Cell’ or ‘Sequencing Saturation’ to help determine when more or less sequencing depth are required.

The most important metrics: 10X help by making the most important stuff big. You should already have an idea of the number of cells you expected to capture (because you carefully counted your cells before starting didn't you), hopeful the ‘Estimated Number of Cells' matches what you were aiming for. Ideally this would be the same across your project, but is likely to be quite variable if the cell types are very different.

The ‘Mean Reads per Cell’ and ‘Sequencing Saturation’ both tell you whether you've over-sequenced. Our recommendation is to run a single lane on hiSeq 400 first and to use these numbers to determine if more sequencing is worth it or not. Diving in for a lane per sample might turn out to be expensive mistake (as it was in the example above).

The ‘Median Genes per Cell’ equals is likely to become a key metric for users. We've become used detecting 10,000-15,000 genes in microarray and RNA-Seq experiments on bulk tissue. What the figure is for single-cell remains to be seen. However it is likely to be quite cell specific, and is also likely to increase as methods capture more of the transcripts.

The ‘Sequencing’ table metrics explained:

‘Number of Reads’ equals the total number of single-end reads that were sequenced.
‘Valid Barcodes’ equals the fraction of reads with barcodes that match the whitelist.
‘Reads Mapped Confidently to Transcriptome’ equals the fraction of reads that mapped to a unique gene in the transcriptome with a high mapping quality score as reported by the aligner.
‘Reads Mapped Confidently to Exonic/Intronic/Intergenic Regions’ equals the Fraction of reads that mapped to the exonic/intronic/intergenic regions of the genome with a high mapping quality score as reported by the aligner.
‘Sequencing Saturation’ equals the fraction of reads originating from an already-observed UMI. This is a function of library complexity and sequencing depth. More specifically, this is the fraction of confidently mapped, valid cell-barcode, valid UMI reads that had a non-unique (cell-barcode, UMI, gene). This metric was called "cDNA PCR Duplication" in versions of Cell Ranger prior to 1.2.
‘Q30 Bases in Barcode/Sample Index/UMI Read’ equals the fraction of bases with Q-score at least 30 in the cell barcode/sample index/Unique molecular identifier sequences.
‘Q30 Bases in RNA Read’ equals the fraction of bases with Q-score at least 30 in the RNA read sequences. This is Illumina R1 for the Single Cell 3' v1 chemistry and Illumina R2 for the Single Cell 3' v2 chemistry.
‘Estimated Number of Cells' equals the The total number of barcodes associated with cell-containing partitions, estimated from the barcode count distribution.
‘Fraction Reads in Cells' equals the The fraction of barcoded, confidently mapped reads with cell-associated barcodes.
‘Mean Reads per Cell’ equals the total number of sequenced reads divided by the number of barcodes associated with cell-containing partitions.
‘Median Genes per Cell’ equals the median number of genes detected per cell-associated barcode. Detection is defined as the presence of at least 1 UMI count.
‘Total Genes Detected’ equals the number of genes with at least one count in any cell.
‘Median UMI Counts per Cell’ equals the median number of UMI counts per cell-associated barcode.

It will help to look at these numbers over time and across projects. Right now the data about the sample is limited, but collecting more sample/experiment metadata is likely help determine whether an experiment has worked or not. Right now it is difficult for us to give advice as your experiment may be the first time we've eve3r run that type of cell!

Monday 24 October 2016

How do I submit my index information into Lablink?

When accepting sequencing submissions in the Genomics Core, there may be instances where we have to contact you if there is an error with your submission form. The most common problems relate to index information. We have put together some instructions here that we hope should make things easier and help us to get started on your sequencing as soon as we can.

Please only follow these instructions if:

· The index sequences you have used are visible in the index sequences tab of the submission form

· There are fewer than 384 samples within your pool.

If the points above are not true, please see section, unspecified index further on in this blog.

1. Completing the sample/reagent label field

1a. Navigate to the index sequences tab of the sample submission form.

1b. Search for your index sequences

1c. Copy the index name from column C of the index sequences tab, e.g A001-A005 to the column Sample/Reagent Label, of the submission form tab.

Figure 1-index sequences tab of the sample submission form

Figure2-submission form

2. Completing the UDF/Index type field

2a. Select the correct UDF/Index type from the drop down menu on the submission form tab.

IMPORTANT –please make sure the Index type field matches column B of the index sequences tab. This ensures that your library goes through our acceptance step. Please see the two following examples.

Example1- I am submitting a Truseq LT library consisting of 5 samples and used indexes A001-A005.

The sample/reagent label on the submission form should read A001-A005.
The UDF/index type should read Truseq LT.

Figure 3 - The index type field next to these indexes is Truseq LT so this is what should be entered into the UDF/Index type field.

Figure 4- Submission form

In most cases, the Index type will match to the indexes you have used as expected. However there are now multiple kits available which share the same indexes.
Because of this, there may be some cases where the index sequences you select will have a different Index type to the library you have made. (see example 2 below) This may affect you if you are submitting for Nextera XT or Nextera.

Example 2- I used indexes N701-N501, N702-N501, N703-N501, N704-N501. I prepared the libraries using a Nextera XT library prep kit.

The sample/reagent label on the submission form should read N701-N501, N702-N501, N703-N501, N704-N501. The UDF/index type should read Nextera and not Nextera XT. This is because the UDF/Index type needs to match column B on the index sequences tab.

Figure 5 - The index type field next to these indexes is Nextera so this is what should be entered into the UDF/Index type field

3. Unspecified Index

If your pool has index sequences not present in the index sequences tab OR if you have a pool which is made up of more than 384 samples, you will need to submit as unspecified index.

In the submission form:

· Sample/reagent label should read – unspecified

· UDF/Index type should read –Unspecified (other)

You should submit your pool as one row on the form. Libraries submitted as unspecified index cannot be demultiplexed by the Genomics Core but we do have a demultiplexing guide on lablink which should give some useful information.

Important-since we have no index sequence information, please write in the comments section of the form the index lengths for Index 1 and Index 2. Without this information your sequencing may be delayed whilst we contact you to check these parameters.
Once you have submitted your libraries, the Genomics Core would like to start working on your sequencing as soon as we can.

If the incorrect index type has been selected, we will need to delete your submission and we would ask you to submit again after making changes to your sample sheet and following the instructions above. Of course whilst this guide should be used to help you, we are always here to discuss this with you in person if you have any questions. Alternatively you can contact us on our helpdesk: genomics-helpdesk@cruk.cam.ac.uk

Sunday 9 October 2016

Recent papers that the Genomics Core has helped with

I like to highlight some of the really interesting work we've been involved with, or that has come out of the Institute from time to time, and I recently updated our lab home page with links to a couple of papers. i thought I'd take the opportunity to write about them in a bit more detail here. Many of you will already know I run the Genomics Core facility at CRUKs Cambridge Institute. We do a lot of Illumina sequencing! The lab works on a huge number of projects for the research groups here in the Institute, and also across many groups in Cambridge via a long-running sequencing collaboration. We do do some R&D work in my lab, but >90% of our efforts are working with, or for, other research groups.

The papers:

Pereira et al Nature Communications 2016: The somatic mutation profiles of 2,433 breast cancers refine their genomic and transcriptomic landscapes
Bruna et al Cell 2016: A Biobank of Breast Cancer Explants with Preserved Intra-tumor Heterogeneity to Screen Anticancer Compounds
Lensing et al Nature Methods 2016: DSBCapture: in situ capture and sequencing of DNA breaks).
Murtaza et al Nature Communications 2015: Multifocal clonal evolution characterized using circulating tumour DNA in a case of metastatic breast cancer

Highlights from the last years genomics research include work from the Caldas group who have completed three project over the lat year I've included here; 1) profiling of almost 2500 Breast Cancer patients for mutational analysis of 173 genes using a targeted pull-down (Pereira et al Nature Communications 2016); 2) cancer exomes from Murtaza et al,; 3) PDXs from Bruna et al.; and the Balasubramanian group who have shown that it is possible to capture and sequence double-strand DNA breaks (DSBs) in situ and directly map these at single-nucleotide resolution, enabling the study of DSB origin (Lensing et al. Nature Methods 2016). The rapid speed and unbiased nature of the genome-wide experiments being performed in the Institute, and often prepped and sequenced in the Genomics core continue to increase our understanding cancer biology.

Why is my HiSeq 2500 sequencing taking longer than usual

With the introduction of the HiSeq 4000 we're able to sequence faster and cheaper than ever before. But as we're transitioning the larger projects over to HiSeq 4000 a side-effect is fewer and fewer samples to run on HiSeq 2500; and as we're waiting for samples to fill the 8 lane flowcell that means longer wait times for you. We thought this post might help you determine if you still need to use HiSeq 2500, or if you can migrate over to HiSeq 4000. Most sequencing is taking under 2 weeks, but some people are now waiting up to one month for 2500 data.

Running a big RNA-seq project is easy(ish)

Last year we completed our largest ever RNA-seq project: 528 samples of TruSeq mRNA, 60 lanes of HiSeq 2500 SE50, 13 billion reads - and all in 16 weeks. Being able to do such a large project in such a short time and get high quality data from nearly all samples really demonstrates the robustness of RNA-seq. If you're thinking that a project larger than 96 samples might be too much to consider, then come and talk to us (and Bioinformatics) at a Tuesday afternoon experimental design meeting - and we'll convince you it can be a pretty smooth process.

We've been using Illumina's TruSeq mRNA-seq automated on our Agilent Bravo robot and the sequencing was done on HiSeq 2500, although we're currently moving to HiSeq 4000.

528 samples processed on six-plates of RNA-seq
QC lanes sequenced and analysed
60 lanes of SE50bp sequencing in total, 10 lanes per plate
12,918,018,345 PF reads for this project (215M reads per lane on average)
24M reads per sample on average
16 weeks from start to finish

This has been a large and complex project where we had lots of discussions along the way. I think that everyone involved has contributed to the success so far: the research group who asked us to do the project, my lab, and also our Bioinformatics Core. The ability to discuss the experiment at different stages, and to focus on QC issues as they arise really makes using the Cores a great place to do your projects.

Sunday 7 February 2016

Our first paper on the bioRxiv

I just uploaded our paper, which has also been submitted to BioTechniques, onto the bioRxiv preprint server. The work we present comes from an idea I had shortly after first using Agilent's BioAnalyser in 2000. I was blown away by this piece of technology that has become the de facto standard for RNA QC, and has also pretty much replaced gel electrophoresis for DNA fragment analysis in NGS applications. When launched in 1999, it was the only microfulidics instrument for biology applications. The idea was a simple one: can bioanalyser chips be swapped between assays?

Following us on Twitter

The Genomics Core now has two Twitter accounts, you can follow me @CIgenomics (James Hadfield, Head of Genomics) and hear about things I think are interesting, but which you might not necessarily be interested in; and/or you can follow our sequencing queue @CRUKgenomecore which puts out live Tweets directly from the sequencing LIMS.

How does the LIMS Tweet: Some clever work by Rich in Bioinformatics has allowed us to pull out data directly from Genologics Clarity LIMs queue using a script run every 24 hours, and the Twitter API then allows that script to post messages on our behalf. Because of this the Tweets about our queue should happen every day and without manual intervention. Hopefully you'll be able to rely on these to give you a reasonable idea of how long you might have to wait for your sequencing results. Of course we can't predict what will happen with your particular sample so please treat the Tweet as a guide.

Tweets explained: The Tweets have a format that we hope is pretty intuitive, but we've described what all the bits of information mean below...

Thanks especially to Rich Bowers in the Bioinformatics core for pulling all of this together from a vaguely described idea by me.