Friday, 4 August 2017

Joy plots

Today is the birthday of John Venn (of Venn diagrams fame). He was born on this day, 4th August 1834. So I was thinking about diagrams today and finally got around to looking up code for the trendiest-plots-of-2017, 'Joy plots'. As of July 2017 we now have ggjoy plotting code in R ( and they're also beautiful in Python's Seaborn ( 

They're named after the Joy Division album cover. But finding out what 'Joy Division' actually means ( is always going to make me less joyful about the joy in using them.

Thursday, 25 May 2017

Releasing research software

How should researchers be releasing their software?

The BBSRC together with the Software Sustainability Institute and Elixir-UK recently held a workshop to find out. This was the "Developing Software Licensing Guidance for BBSRC Workshop" (April 2017). The BBSRC wanted to ask the community what guidance it should be providing for grant applicants, grant reviewers and the developers and users of research software.

Here's some of the points that were raised during the day: 


Who owns the software? University academics often have contracts that say that everything we do belongs to the uni, even if it's in the evenings or weekends. Students, on the other hand tend to own their own IP. On collaborative projects, especially those collaborating with businesses, the ownership of software becomes more complex again. If the software is truly open source, then hopefully multiple people will contribute ('random strangers on the internet'). Who owns it then?


Three main options: very permissive, copyleft or commercial. If you need to choose a software license, some universities are well supported by technology transfer staff who understand all the implications and some are not. There are websites to help people choose, but the issues are complex and so far they haven't helped me choose. The consensus for permissive licenses seemed to be that while MIT was good (simple, easy to read and understand), the Apache 2.0 license actually deals with accepting contributions from other developers in future ("Unless You explicitly state otherwise, any Contribution intentionally submitted for inclusion in the Work by You to the Licensor shall be under the terms and conditions of this License, without any additional terms or conditions."). If you don't use this license then you may have to use a contributer licensing agreement and that can scare off the casual patch submitter. The "for academic use only" kind of license was widely seen to be awful, restricting future collaboration, restricting project expansion and being unclear about who is and isn't an academic.

Software is produced at different scales

There are quick scripts, solid code, and whole platforms. The BBSRC will be funding work that produces all three types of software. It wants all grant-holders to think about how they will release their software and support the creation of software. We know that documenting, testing and packaging is time consuming. How do we ensure that projects allow time for this? How can software be cited properly (see Software citation principles)? Is there any such thing as a 'throwaway script'? Will the project management team include someone who knows about software?

Existing advice

There is a collection of existing advice from various places about how to choose licenses and release software.
The release of software as an output of research is clearly an issue that is being raised by multiple organisations now and it's great to see software being taken seriously. 2017 will see the second Research Software Engineers Conference (RSE2017) and a June 2016 Dahgstul Workshop produced an Engineering Academic Software Manifesto, containing pledges such as
  • I will make explicit how to cite my software.
  • I will cite the software I used to produce my research results.
  • When reviewing, I will encourage others to cite the software they have used.
and more. The BBSRC now has the task of pulling together all the discussion from the workshop and other places and creating a guidance document to help grant applicants, reviewers and panellists, and also the software developers. This will then become part of the grant proposal process along with other docs such as the data management plan, pathways to impact, justification of resources, etc.

Thursday, 27 April 2017

Ada Lovelace and her contribution to computer science

Ada Lovelace (1815-1852) is known as the first computer programmer. What did she really contribute to computing? So many blog posts, articles and books get bogged down describing her childhood, and her famous father and overbearing mother. But what did she actually do?

A while ago I got stuck in and actually read the paper she actually wrote.  "Sketch of the Analytical Engine invented by Charles Babbage... with notes by the translator. Translated by Ada Lovelace"

In fact, she translated into English a paper by an Italian engineer, Luigi Menabrea, and clearly felt that his description did not go far enough, because she added her own notes at the end, which amount to more detail than the original translation.

The original paper by Menabrea was supposed to sell the idea of the Analytical Engine, a machine designed by Charles Babbage but not yet constructed. Money needed to be raised for this endeavour. Menabrea explains what the machine can do. To some extent he does try also to sell the idea, but his writing can be quite dry: 'To give an idea of this rapidity, we need only mention that Mr. Babbage believes he can, by his engine, form the product of two numbers, each containing twenty figures, in three minutes.'

Lovelace wanted to add more, and her writing is glowing with excitement about the machine. She added 7 detailed discussion notes, labelled A to G, and also 20 numbered footnotes. In the numbered footnotes she comments on issues with the translation. She also comments sometimes on how she feels that Menabrea hasn't quite understood completely the novelty and potential (for example 'M. Menabrea's opportunities were by no means such as could be adequate to afford him information on a point like this'). In the notes labelled A to G she explains many of the concepts that are the fundamentals of computing today.

Note A

This note spells out that the Analytical Engine is a general purpose computer. It doesn't just calculate fixed results, it analyses. It can be programmed. It's flexible. It links 'the operations of matter and the abstract mental processes of the most abstract branch of mathematical science'. This is not just a Difference Engine, but instead offers far more. She is very clear that 'the Analytical Engine does not occupy common ground with mere "calculating machines". It holds a position wholly its own'. It's also not just for calculations on numbers, but could operate on other values too. It might for instance compose elaborate pieces of music. 'The Analytical Engine is an embodying of the science of operations'. Lovelace has a beautiful turn of phrase that evokes her delight in imagining where this could go: it 'weaves algebraic patterns just as the Jacquard loom weaves flowers and leaves'.

Note B

This note is about memory, the 'storehouse' of the Engine. It is about variables, and retaining values in memory. She describes how the machine can retain separately and permanently any intermediate results, and then supply these intermediate results as inputs to further calculations. Any particular function that is to be calculated is described by its variables and its operators, and this gives the machine its generality.

Note C

This note is about reuse: the Engine uses input cards, as used in the Jacquard looms. Lovelace had toured the factories of the north with her mother and seen the Jacquard looms in action. They would have been the cutting edge of technology at the time. Intricate patterns were woven by machines, with thousands of fine threads, all controlled by punched cards. The cards needed to be created by card punching machines, with detailed tables of numbers that would be translated into the patterns of holes. The cards allowed a design to be repeated, and they were bound together in a sequence. She explains that these input cards, can be reused 'any number of times successively in the solution of one problem'. And furthermore, that this applies to just one card, or a whole set of cards. The train of cards can be rewound until they are in the correct position to be used again.
Jacquard loom, Nottingham Industrial Museum

Punched cards used in Jacquard loom, Nottingham Industrial Museum

Note D

This note is about the order and efficiency of operations. She explains that there are input variables, intermediate variables, and final results. Every operation must have two variables 'brought into action' as inputs and one variable to use as the result. We can then trace back and inspect a calculation to see how a result was obtained. We can see how often a value was used. We can think about how to batch up operations to make them more efficient. She begins to imagine not just how the machine will operate, but how programmers will have to think carefully about their programs, how they will debug them and how they will optimise the algorithms they implement.

Note E

In this note Lovelace explains how loops or 'cycles' can be used to solve series, for example to sum trigonometrical series. With a worked example (used in astronomical calculations), she abstracts out the operations needed to produce terms in the series and works on a notation for expressing cycles that include cycles. Note 18 expands further to explain that one loop can follow another, and there may be infinitely many loops.

Note F

This note begins with the statement 'There is in existence a beautiful woven portrait of Jacquard, in the fabrication of which 24,000 cards were required.' Babbage was so taken with this woven portrait that he bought one of the few copies produced. The amount of work that machines could do, unfailingly, without tiring, was changing the face of industry. The industrial revolution was sweeping through the country. Lovelace explains how the Engine will be able to solve long and intensive with a minimum of cards (they can be rewound, and used in cycles). The machine can calculate a long series of results without making mistakes, solving problems 'which human brains find it difficult or impossible to work out unerringly'. It might even be set to work to solve as yet unsolved and arbitrary problems. 'We might even invent laws for series or formulae in an arbitrary manner, and set the engine to work upon them and thus deduce numerical results which we might not otherwise have thought of obtaining; but this would hardly perhaps in any instance be productive of any great practical utility, or calculated to rank higher than as a philosophical amusement.'

Note G 

She concludes with the scope and limitations of the Analytical Engine. She is keen to stress the machine's limitations and about using the machine for discovery: it doesn't create anything. It has 'no pretensions to originate anything'. Lovelace then enumerates what it can do, but warns against overhyping it. Alan Turing references her concerns in his 1950 paper Computing Machinery and Intelligence (which is also well worth a read). In this paper he discusses 9 objections that people may raise against the possibility of AI. One of these 9 objections is Lady Lovelace's Objection. He writes:
Our most detailed information of Babbage's Analytical Engine comes from a memoir by Lady Lovelace (1842). In it she states, "The Analytical Engine has no pretensions to originate anything. It can do whatever we know how to order it to perform" (her italics).
In Note G we also get the detailed trace of a substantial program to calculate the Bernoulli numbers, showing the order of operations combining variables to make results. The trace shows looping and storage, showing intermediate results, and the correspondence with the mathematical formulae being calculated at each step. She inspects the program for efficiency, working out the number of punched cards required, the number of variables needed and the numbers of execution steps.

Lovelace wonders, in the third paragraph of this Note, whether our minds are able to follow the analysis of the execution of the machine. Will we really be able to understand computers and their abilities? 'No reply, entirely satisfactory to all minds, can be given to this query, excepting the actual existence of the engine, and actual experience of its practical results.'

Ada Lovelace is buried in Hucknall Church, Nottingham. She died aged 36, and never got to see the Analytical Engine constructed.

Monday, 10 April 2017

Visiting JG Mainz University

I've just returned from a two week visit to Johannes Gutenberg University Mainz, where I was hosted by Andreas Karwath in the Data Mining group of the Dept of Informatik. I was there to pick up new research ideas, to get away from admin/teaching duties for a while and to share what we've been working on lately. 

The University at Mainz is a large campus based university on the outskirts of the town, on the beautiful river Rhine. It's named after the inventor of the printing press in the west, Johannes Gutenberg. The Gutenberg museum is excellent, tracing books from wax tablets through handwritten parchments, to moveable type print, then the industrial revolution, typewriters and high volume printing. This is a museum about the value of information and its transmission, and the technology invented to do this. There was even a special exhibition about the Futura font, as an added bonus. This is the geometric circles-and-lines font used in posters for "2001 A Space Odyssey", and on the Apollo 11 moon plaque ("We came in peace for all mankind"), and in so much future-looking advertising and propaganda in the Art Deco inspired 1930s era, both in Germany and beyond.

It was interesting to be embedded in a data mining group, rather than bioinformatics for a change. They have a broad range of application areas, but also happily switch technology (neural nets, relational learning, topic models, matrix decomposition, graphs, rs-trees and more) as the application area needs. Also very interesting to see in which ways a different country's research culture is different. It's not REF-dominated like the UK, so more they're more free to focus on quick-turnaround peer-reviewed compsci conference publications, and perhaps more hierarchically structured, as only the few professors have permanent positions. And yet it's still the same. University departments are international places and share much in common, whichever country you're in: same grant applications, student supervisions, seminar talks, dept silos, etc.

It was great fun to be there, and they were excellent hosts. At the same time, it was strange and sad to be a British person on exchange in Germany during the week that the UK sent article 50 to the EU. International collaborations are so important to research that leaving the EU is bound to be hugely detrimental to us UK academics. We need more exchange, not less.

Tuesday, 4 April 2017

The metahaplome

A sample of water, soil or gut contents will contain a whole community of microbes, cooperating and competing. We now have the sequencing technology to begin to explore these communities, to find out the variation that they possess. This can be useful in the search for new anti-microbials, or in the search for better enzymes for biofuels. However, the sequencing technology is not quite there yet. The very short length of the reads, together with the errors introduced, combine to make the problem of reassembling the underlying genomes much more complex.

We introduce the concept of the 'metahaplome': the exact sequence of DNA bases (or "haplotype") that constitutes the genes and genomes of every individual present. We also present a data structure and algorithm that will recover the haplotypes in the metahaplome, and rank them according to likelihood.

Our preprint about the metahaplome is now available at bioRxiv: Probabilistic Recovery Of Cryptic Haplotypes From Metagenomic Data.

Monday, 20 March 2017

Is academic impact useful as a proxy measure for real world impact?

In academia we have various measures of "success". Some of the more common measures are "how many papers did we publish in reputable journals" and "how many other academics went on to use my work or refer to my work". We count citations, check out our h-index, and perhaps even check the number of other academics talking about our work on social media. We'd all like our work to be useful to others. Of course, any measure of success can be gamed. Academics may cite themselves to boost citation counts, use clickbait paper titles to attract attention, and select journals for prestige rather than availability to the community.

However, to be useful to other academics is not the same as to be useful to the rest of the world. Are we also having impact outside of academia? Some blue sky basic research is unlikely to do this (but perhaps likely to be cited by academics). Some applied research can have immediate impact on medical outcomes, law and policy, societal attitudes, civil rights, environmental strategies and business practices.

The UK tries to measure this kind of research impact by asking universities to submit REF Impact case studies. These are summary documents, written by academics, describing what impact they've had. They're not easy to write, or to evidence, and yet they are used as part of the REF exercise measuring research quality, whereby Higher Education funding bodies distribute research money to the universities.

How can we make it easier for academics to find and present their impact? Before we answer that, is the whole exercise worth doing anyway? We need to know if we could just cheaply count citations and use this as a proxy for real world impact. After all, if a paper is popular with academics, surely it's also going to be useful to the rest of the world too?

We've done the analysis in Measuring scientific impact beyond academia: An assessment of existing impact metrics and proposed improvement. TL;DR: the answer is 'no'. Academic impact is not correlated with more comprehensive impact in the real world. We're not surprised, but we needed to prove it. But don't take our word for it: we've made all the data available. Try it yourself.

On with the next steps now... James will now be data mining the real world impact of scientific research, from the collection of unstructured documents out there in news archives, parliamentary proceedings, etc. We expect NLP, data mining and machine learning challenges ahead as we trace the movement of academic ideas and results out from academia and into the rest of the world.

Wednesday, 3 August 2016

Holding an internal research workshop

We have just held the 4th Aberystwyth Bioinformatics Workshop. It's a one-day workshop, held with no budget, and intended to be a mostly internal informal research networking event.

We call for 5 min lightning talks, 20 minute longer talks, demos of software, and posters. We end up with a good mixture of both. We especially encourage new PhD students to present, and for all attendees to be friendly and supportive rather than combative in their questions. Registration is done by a very simple Google form (name, email, what kind of talk, title of talk/poster, any other comments). Registration closes one week before the workshop. Tea and coffee is acquired somehow, a room is booked, talks are arranged into a programme, and then away we go.
Aber Bioinformatics Workshop attendees July 28th 2016. Photo by Sandy Spence.

 Each time we've done this we have ended up with a full day of talks. People use it to let others know what they're working on, to practise a talk they're preparing for an external conference, to ask for advice on their work, to describe the state of the compute cluster facilities and to just introduce new people. Bioinformatics at Aberystwyth is mostly done within the biology departments of IBERS, but this meeting allows Computer Science and Maths people to join in, and make interdisciplinary links. Finally we go down to the pub, and continue the discussions there.

It's a very low cost minimal preparation way to bring together a group of otherwise independent researchers. Many bioinformaticians feel that they are either the only one in their group, or else, that they're not really a bioinformatician at all and somehow masquerading as one. I've learned a great deal from each workshop that we've had, and its just great to find that we do have a surprisingly strong local support network in such a specialist field.