Amanda Clare: December 2015

Wednesday 23 December 2015

Seamless gene deletion

2015 is the year that genome editing really became big news. A new technique, "CRISPR/CAS", was named as Science magazine's breakthrough of the year as voted by the public from a shortlist chosen by staff.

However, people have been manipulating DNA through many useful methods long before CRISPR/CAS made headlines. Gene deletion is an important tool when trying to understand the function of genes. Take out a gene and see what effect it causes. Genes can be disrupted (by removing a portion of the DNA or inserting some extra DNA) or can be interfered with, for example via their RNA production, or they can be entirely deleted. It's common practice when removing a gene to insert a marker, so that we can easily select for the cells where this procedure has been successful. For example, to insert an antibiotic resistance gene as a marker, so that we can now grow the cells on a plate with an antibiotic. Then only those that have lost our gene of interest and gained antibiotic resistance will now grow. The trouble with this is that many gene deletions have no visible effect by themselves. If we also want to delete a second gene and a third, then we need more markers, or we need to be able to remove and reuse the marker we inserted. We also don't want the process to leave any scars behind that could destabilise the genome. We've just published a paper to help solve this problem.

This process of 'swap a gene of interest for a marker gene' can be achieved in many organisms by homologous recombination. This is a process used by many cells to repair broken strands of DNA. If we provide a piece of DNA that has a good region of similarity to the region just downstream of our gene of interest, and also a good region of similarity to the region just upstream of the gene of interest, but instead of the gene of interest, has the marker gene between these regions, then the normal cellular processes of homologous recombination will exchange the two. Some organisms perform homologous recombination very readily (S. cerevisiae for example). Others may need a little more encouragement, such as creating a double stranded break.

Our new paper A tool for Multiple Targeted Genome Deletions that Is Precise, Scar-Free and Suitable for Automation with Wayne Aubrey as first author uses a 3-stage PCR process to synthesise a stretch of DNA (a 'cassette') that will do everything. It will have good regions of similarity to the regions upstream and downstream of the gene of interest. It will contain a marker gene. And (here's the good bit), it will contain a specially designed region ('R') before the marker gene that is identical to the region that occurs just after the gene of interest. In this way, after homologous recombination has done its thing and inserted the DNA cassette instead of the gene of interest, there will be two identical R regions, one before the marker gene, and one after the marker gene. Sometimes the DNA will loop round on itself, the two R regions will match up and homologous recombination will snip out the loop, including the marker gene.

We can encourage this to happen and select for the cells that have had this happen if our marker is also 'counter-selectable'. That is, we'd like a marker for which we can add something to the growth medium so that now only cells without the marker will now grow. That is, we'd like to use a marker or marker combination for which we can first select for its presence and then counter-select for its absence. When we have this we can select for cells that have had the marker replace the gene, and then counter-select for cells that have now lost the marker too. So we have a clean gene deletion.

Of course we're always standing on the shoulders of giants when we do science. Our method is an improvement on a method by Akada 2006, so that no extra bases are lost or gained and the method requires no gel purification steps. Just throw in your primers and products and away you go. It's not fussy about quantity. No purification steps means that it could be automated on lab robots. And it could be used to delete any genetic component, not just genes. Give it a try!

Thursday 17 December 2015

Data science and a scoping workshop for the Turing Institute

In November I went to a workshop to discuss the remit of the Alan Turing Institute, the UK national institute for data science, with regard to the theme of "Data Science Challenges in High Throughput Biology and Precision Medicine". This workshop was held in Edinburgh, in the Informatics Forum, and hosted by Guido Sanguinetti.

The Alan Turing Institute is a new national institute, funded partly by the government, and partly by five universities (Edinburgh, UCL, Oxford, Cambridge, Warwick). The amount of funding is relatively small compared with that of other institutes (e.g. the Crick) and seems to be enough to fund a new building next door to the Crick in London, together with a cohort of research fellows and PhD students to be based in the new building. What should be the scope of the research areas that it addresses and how should it work as an institute? There are currently various scoping workshops taking place to discuss these questions.

Data science is clearly important to society, whether it's used in the analysis of genomes, intelligence gathering for the security services, data analytics for a supermarket chain, or financial predictions for the city. Statistics, machine learning, mathematical modelling, databases, compression, data ethics, data sharing and standards and novel algorithms are all part of data science. The ATI is already partnered with Lloyds, GCHQ and Intel. Anecdotal reports from the workshop attendees suggest that data science PhD students are snapped up by industry, ranging from Tesco to JP Morgan, and that some companies would like to recruit hundreds of data scientists if only they were available.

The feeling at the workshop seemed to be a concern that the ATI will aim to highlight the UK's research in the theory of machine learning and computational statistics, but risks missing out on the applications. The researchers who work on new and cutting edge machine learning and computational statistics don't tend to be the same people as the bioinformaticians. The people who go to NIPS don't go to ISMB/ECCB. And KDD/ICML/ECML/PKDD is another set of people again. These groups used to be closer, and used to overlap more, but now they rarely attend each others' conferences. Our workshop discussed the division between the theoreticians who create the new methods but prefer their data to be abstracted from the problem at hand, and the applied bioinformaticians, who have to deal with complex and noisy data, and often apply tried and tested data science instead of the latest theoretical ideas. To publish work in bioinformatics generally requires us to release code and data, and to have shown results on a real biological problem. To publish in theoretical machine learning or computational statistics, there is no particular requirement for an implementation of the idea, or to demonstrate its effectiveness on a real problem. There is also a contrast between the average size of research groups in the two areas. Larger groups are needed to produce the data (people in the lab to run the experiments, bioinformaticians to manage and analyse the data, and these groups are often part of larger consortia) whereas the theoreticians are often cottage-industry style research with just a PI and a PhD student. How should these styles of working come together?

Health informatics people worry about access to data: how to share it, get it, and ensure trust and privacy. Pharmaceuticals worry about dealing with data complexity, such as how to analyse phenotype from cell images in high throughput screening, having interpretable models rather than non-linear neural networks, and how to keep up with all the new sources of information, such as function annotations via ENCODE. GSK now has a Chief Data Officer. Everyone is concerned about how to accumulate data from new bio-technologies (microarrays then RNA-seq, fluorescence then imaging, new techniques for measuring biomarkers of a population under longitudinal study). Trying to keep up with the changes can lead to bad experiment design, and bad choices for data management.

There was much discussion about needing to make more biomedical data open-access (with consent), including genomic, phenotypic and medical data. There seemed to be some puzzlement about why people are happy to entrust banks with their financial data, and supermarkets with purchase data, but not researchers with biomedical data. (I don't share their puzzlement: your genetic data is not your choice, it's what you're born with, and it belongs to your family as much as it belongs to you, so the implications of sharing it are much wider).

All these issues surrounding the advancement of Data Science are far more complex and varied than the creation of novel and better algorithms. How much will the ATI be able to tackle in the next five years? It's certainly a challenge.