Features Physics World  February 2015
(iStock / agsandrew)

Let’s be open

Scientists, governments and funding agencies have for some time been pushing for data generated by publicly supported scientific research to be made available and free. But just how could “open data” work in practice? Jon Cartwright investigates

Any great research finding in physics is sure to get people talking. Such was the case last January when physicists operating the CoGeNT detector at the Soudan Underground Laboratory in Minnesota, US, reported an intriguing rise in electronic blips at energies of less than one kilo-electronvolt (keV). This is the sort of energy at which a particle of dark matter – that elusive, invisible substance thought to make up more than four-fifth’s of the universe’s matter – is expected to recoil off an atomic nucleus.

One person intrigued by the CoGeNT findings was Celine Boehm, a particle theorist at Durham University in the UK. She is not part of the collaboration and in normal circumstances would have had to take CoGeNT’s findings on trust. Fortunately for Boehm, the CoGeNT group had made public its full dataset, with all the electronic readings taken from a 0.5 kg crystal of germanium over more than three years being available in an easily digestible format. Checking over the CoGeNT group’s analysis, and particularly its estimation of the fraction of signals that should be interpreted as background, Boehm and her colleagues concluded that there was not any significant rise in signal below 1 keV, after all.

The finding might have been a blow to those seeking evidence of dark matter, but it was a triumph for open data – the practice of making scientific data public and free to use. “Evidence for this sort of particle would be extremely interesting,” Boehm explains. “That is why we took the CoGeNT results so seriously. I’m so grateful that they actually provided the data – at least we could understand what was going on in the analysis, the assumptions they were making and whether they were compatible with the theory.”

Open data can be a worthwhile, even moral philosophy, but one that does not always accommodate practical limitations

Boehm’s reanalysis highlights one of the main benefits of open data: the ability to reuse the product of other people’s experiments by applying it to a new problem, or – as in the case of Boehm – contradicting an original analysis. To its proponents, the idea of open data seems so at one with the scientific method that there could be little reason to argue against it. Sceptics, on the other hand, see as many problems created as solved. At one end of the scale are fears of data misuse, breaches of privacy or being beaten to a first analysis. At the other end are those immortal grievances: too little time, not enough money.

“Open data is great, as long as you’re not the one who has to acquire, manage and secure it,” says Roy Wogelius, a geochemist at the University of Manchester in the UK who has used the huge amounts of data generated by imaging experiments at synchrotron-radiation sources. For Wogelius and others, open data can be a worthwhile, and in some cases even moral philosophy; but it is also one – particularly when data are more abundant than ever – that does not always accommodate practical limitations.

Out in the open

Whatever the worries, open data is catching on fast. In recent years many of the major funding bodies, including the UK’s Engineering and Physical Sciences Research Council (EPSRC), the European Commission and the US’s National Science Foundation, have announced open-data policies that generally air a commitment to making freely available as much publicly funded research data as possible. EPSRC’s first principle on data sharing states that any data generated by research that it funds “is a public good produced in the public interest and should be made freely and openly available with as few restrictions as possible in a timely and responsible manner”.

Open-data policies have trickled down to individual scientific institutions, too. In November 2014 the European particle-physics lab CERN launched its Open Data Portal, through which anyone can access petabytes of historical particle-collision data. In some areas of physics, such as astronomy, the practice is already a near standard, with large-scale surveys such as the international Dark Energy Survey (DES) releasing all data after a certain “proprietary” period during which time those who took the data have the first crack at looking for new findings. “We expect and hope that [other] scientists will use and make discoveries with the data in ways that we haven’t even thought of,” says DES director Joshua Frieman.

At the individual level, open data are also becoming more popular. In March 2014 the books and journal publisher Wiley surveyed more than 2000 scientists worldwide and found that on average 52% make their data public. (In the physical sciences the figure is a little lower, at 45%.) Yet the same survey revealed a variety of reasons for not sharing data. Chief among these were concerns over intellectual property or confidentiality (the latter more pertinent to the life, health and social sciences). However, many other non-principled reasons were also given: “I did not know how to share my data”; “I did not know where to share my data”; “insufficient time and/or resources”, to name but three.

Information overload A team at the European Synchrotron Radiation Facility in France was able to use X-rays to identify what food ammonites had consumed. But studies like this generate many terabytes of data. (ESRF)

Rudolf Dimper, head of the technical infrastructure at the European Synchrotron Radiation Facility (ESRF) in Grenoble, France, agrees that these are very real problems for some scientists. Unlike CERN – a lab that has become a second home to many scientists working within one discipline (high-energy physics) – ESRF is an analytical facility shared by scientists from different fields and backgrounds. Researchers performing several days of X-ray imaging at ESRF, says Dimper, could easily find themselves with 50 TB of raw data. This would be a relatively small amount for a specialist lab such as CERN to curate and share, but for a small visiting research group – particularly one that is not well funded – it is nigh-on impossible. In fact, a key role for Dimper and his colleagues is to help such groups reduce their initial dataset so that it is small enough to physically take away with them on a hard drive.

But that is easier said than done. Even with schemes in place to reduce the volume of data, Dimper is concerned that, with more data being generated all the time, some groups will struggle to take their measurements home. “We have detectors that produce so much data that many scientists will have a real problem knowing what to do with it,” he says. “If a scientist comes to our facility and is not able to take away the data, what do we do? Should we not do these experiments? Do we keep data only until they are analysed?” In any case, he adds, no-one strictly owns the data generated at ESRF, so it is not clear with whom the responsibility lies when it comes to making data openly available – at present the facility itself does not have the storage capacity.

Data divides

Potential problems with open data do not end once data are finally “out there”. In the Wiley survey, more than a quarter of respondents voiced concern that their data, once freely available, would be subject to “misinterpretation or misuse”. Deliberate misinterpretation of data is a bane of many politically divisive research areas, such as climate science (although in this case, it is not climate scientists but meteorological offices that own observational data). In less political areas, however, many would say that disagreements over interpretation are a good thing and a sign of healthy science.

But a lack of access to raw data can itself become a point of contention. Indeed, a long delay in the release of raw data from scanning-tunnelling-microscopy experiments helped fuel a heated debate surrounding the existence of “stripy nanoparticles” – nanoparticles that supposedly have molecules arranged in stripe-like structures on their surface (see June 2014, News & Analysis). This non-availability was one reason why Raphaël Lévy, the physicist at the University of Liverpool in the UK who led the sceptics’ charge against stripy nanoparticles, has become “totally committed” to open data. “Although there are practical challenges in some cases, those are not insurmountable and I can see only benefits,” he says.

Of course, it is not always necessary to have open access to someone else’s raw data to conduct a new analysis. One famous example took place in June last year, when cosmologists Uroš Seljak and Michael Mortonson at the University of California, Berkeley, in the US independently dismissed widely publicized evidence for cosmological inflation that had been reported three months earlier by members of the BICEP2 telescope collaboration in the South Pole. The pair did so using nothing more than the compressed results plotted in the original BICEP2 paper.

Data generator The Dark Energy Survey camera takes hundreds of high-resolution images every night, and each is 10 GB. (FNAL)

And in some cases, an original publication is not even necessary: in 2008, particle theorists Marco Cirelli at the Alternative Energies and Atomic Energy Commission in Saclay, France, and Alessandro Strumia at Pisa University in Italy managed to produce an independent analysis of results from the PAMELA dark-matter experiment using data that one of them had photographed from a slide presented during a conference talk.

Cirelli and Strumia’s “paparazzi” approach highlights another concern about open data, which is that by releasing your data someone else might make a breakthrough at your expense. Scooping ought not to be a problem most of the time, since most scientists agree that data ought to be protected by embargo for a certain proprietary period, typically the time it takes for the original data-gatherers to produce their first paper. Without such a proprietary period, the argument goes, there would be no incentive for researchers to gather the data in the first place, with all the associated time and cost.

But not everyone believes in waiting until publication to share data, and this is often where the fear of getting beaten to a breakthrough arises. Cameron Neylon, director of advocacy at the open-access publisher Public Library of Science (PLOS), believes in publishing data as soon as they have been gathered (although this is not a PLOS requirement). He admits that the practice does leave a scientist open to scoops, but not necessarily the sort of scoops that could damage the scientist’s career. “I am not aware of any cases where publicly shared data were reused in a way that prevented someone publishing what they had intended to do,” he says. “I am more aware of cases where the sharing led to new collaborations, or offers of co-authorship.”

In fact, Neylon believes working in the open – including by keeping a publicly accessible online lab book – can prevent scoops, by asserting priority. “My group was the first group to publicly describe the use of a particular enzyme that fluorescently labels proteins in a general way,” he says, recalling an example from his time as a biophysicist. “Someone else published a paper on it first – they were working on it independently, and I don’t think they knew about what we were doing. But I can point to our publicly available online lab book and say, ‘No, we did it first.’ ”

Neylon believes there is more to open data and open lab books than altruism. “If you work openly, people can see and in some ways interact with what you’re doing – you get really good input and advice…Frankly I can be a bit lazy, a bit sloppy – but if everything’s public, my work improves.”

Falling short

In a world where the fight for the next research grant is so competitive, few people will be converted to Neylon’s extreme approach – indeed, he refers to himself as an “outlier”. Nevertheless he still advocates with PLOS for the less revolutionary approach of making data open upon publication, and he believes scientific journals have a key role to play in making this happen. Although most journals have a policy requiring authors to make available all data accompanying an article, he says that this is rarely enforced. “Most people would sign up to the idea that data should be made available,” he insists. “In practice, people stepping up to the plate and delivering is very patchy.”

Precious load Scientists can be reluctant to share their data publicly. (CC BY / PLOS Biol. 12 e1001779 / Ainsley Seago)

The situation is getting better. In April 2014 IOP Publishing, which publishes Physics World and more than 70 other journals and magazines, teamed up with the online data repository Figshare to investigate how its authors could share data more easily. The following month Nature Publishing Group launched a new journal, Scientific Data, specifically for publishing large valuable datasets. Meanwhile, PLOS has been implementing a new data policy, which forces its authors either to upload all supporting data, or (in special circumstances, for example when some of the data are commercially sensitive) to give specific instructions for how to obtain them. “The data availability statement cannot read simply, ‘E-mail the authors’,” says Neylon. “That’s really the big step we’ve taken.”

Ultimately, though, more data will become open only if scientists themselves believe that the cause justifies the means. Neylon thinks a new culture is needed – one in which there is always a question at the back of the scientist’s mind: how will these data be shared? “We’re not really talking about a change in values – most people, if prodded, would agree that making data available is the right thing to do,” he says. “But we do need a change in culture around how we do that. And that’s a hard problem.”

  • What are your thoughts on open data? E-mail us at pwld@iop.org to let us know