Friday, 23 August 2013

Venn diagrams

I'd always thought that drawing Venn diagrams were quite trivial, until I needed to create some recently. They are trivial, if we only have 2 equal-sized sets. If we have 3 sets, and we want the circle sizes to represent the number of data items in each set, then the layout algorithm is more complex. Luckily there's a great package for Python called matplotlib-venn which does exactly what I wanted.

However, if I have 4 sets, then it gets more complicated (and isn't yet handled by matplotlib-venn). A Venn diagram must show all possible intersections. Venn used overlapping ellipses to show how this could be achieved.

Diagram by RupertMillard from Wikimedia Commons

This is where Venn diagrams differ from Euler diagrams. Euler diagrams don't show empty intersections, so they can look much simpler than Venn diagrams, and can contain fully-nested circles. There are Venn diagrams that can represent all the overlaps of 5 and 6 sets, but we'd end up with some extremely complex diagrams that don't really aid the visualisation of our data.

Update (9 Oct 2013): Here's a Javascript/D3.js interactive version with more thoughts on why it's a difficult problem.

Update (20th March 2014): Here are a couple of completely over the top diagrams: a pine tree and a banana. Venn or Euler?

Monday, 12 August 2013

Reviewing for triple-blind conferences

Computer Science is different to some other fields of research in that it uses refereed conferences as the main quick-turn-around publication venue. There are many excellent computer science conferences with high quality submissions and high quality reviewing. Submitting your paper to a CS journal can take years (and give you a dialogue, where you get a change to fix any problems), but a conference will give you a straight yes or no decision, and good feedback, within around 3 months. And then if accepted, your work gets the publicity of a presentation too.

However, for the reviewers this can mean a heavy load if the organisers don't get enough reviewers together. I have just finished reviewing my allocation of 9 papers for ICDM 2013. Some people would only review 10 papers in an entire year! There were some excellent submissions amongst my 9 and it's going to be strongly competitive this year. Nine is a rather tough load, but one of the reasons I do reviews is because it makes me read more widely in areas tangential to mine, and it keeps me up to date. It's also a payback for all the reviews that others have done for me.

But ICDM has a triple-blind submission procedure. The authors don't include their names, and in fact they have to try to remove any evidence of themselves from the paper ("We extended the work of Smith et al" rather than "We extended our previous work"). They don't get to know who wrote their reviews, and even the programme committee co-chairs don't get to know the identities of the authors or reviewers until after the decisions have all been made.  This is supposed to help us be fairer. So we won't be unduly influenced by big names or previous work. However, in practice, it's almost always possible to work it out, to recognise citations, writing style, working area, etc. People who are proud of their work will possibly try to leave you clues anyway. Very little work in a research group is done in isolation of previous work in that area. Data mining researchers don't make up such a large community that this could be hidden.

In fact I'd argue that the triple blind process is counter-productive. When other reviewers can see my name and we have to debate any disagreements over our reviews, then I feel my reputation is at stake if I give a poor review. So I'll go to lengths to make sure everything I say is fair. If my identity is known to the authors then likewise, I'll make extra sure that I'm not just seen as complaining, but actually giving them useful advice. We encourage better review quality by making it open. And seeing as how I can guess most of the authors anyway, there's little point in hiding their names. All it does is stops them making their code and data available for inspection. Some of the papers said "A link to the code will be inserted after the review process" and some papers just had no mention of making any data/code available. They would have been discouraged from doing so by the blind review, rather than encouraged.

Instead of closing and anonymising the system (blind, double-blind, now triple-blind), I think it would be more productive to open it. Would it be any worse if we did? There are certainly ways it which it would be better.