Big Science, Big Data, Big Challenges: Data from Large-Scale Physics Experiments

Saturday, 15 February 2014
Regency D (Hyatt Regency Chicago)
David Reitze , California Institute of Technology, Pasadena, CA
Large-scale physics experiments generate huge quantities of data.  As an example, the US Advanced Laser Interferometer Gravitational-wave Observatory (LIGO) will produce 1 petabyte (=1015 bytes) of data per year when it comes online in 2015.  The Large Hadron Collider currently generates approximately 25 times that amount now. 

 The storage, reduction, and analysis of these complex data sets have typically carried out by large teams of ‘insiders’; expert scientists who belong to collaborations organized around the production of scientific results.  In the past few years however, there has been a strong movement to broaden access to large-scale physics data in the US driven both by ‘top down’ pressure (i.e., federal agency policies) and ‘bottom up’ pressure (i.e., ‘outsider’ researchers who want to access to data for their own research interests).

By and large the movement toward broader access is a positive one, but one that requires care in implementation.  Providing open access to large complex data sets is neither easy nor inexpensive.  It requires technical effort (in long term curation, data reduction and associated metadata production, data delivery in commonly used data formats, software to read and visualize the data, and associated documentation) as well as an understanding of the needs of the broader research community. There are cultural barriers to overcome (‘It’s my data, why should I have to give it to you?’) as well as implications for intellectual property rights of the data producers.

In this talk, I will survey current trends in open access and use LIGO as a case study to illustrate both the benefits and the challenges associated with providing large data sets to the broader research community.