- The release includes both high-level collision data and simplified datasets.
- Along with the data, analysis tools have also been made available.
- The data holds potential for both researchers and educators.
The CMS Collaboration has made 300 TB of high-quality data from the Large Hadron Collider (LHC) available to the public through the CERN Open Data Portal. The data comes from the Compact Muon Solenoid (CMS), which is one of the two large general-purpose particle detectors designed to observe phenomena produced by collisions in the LHC.
Collision data can be downloaded in two types: The so-called ‘primary datasets’ are high-level collision data in the same format used by the CMS Collaboration to perform research. The ‘derived datasets’ on the other hand are simplified and require a lot less computing power, and can be readily analyzed by university or even high-school students.
The CERN Open Data Portal was launched in November 2014. The LHC experiments ALICE, ATLAS, CMS, and LHCb have each contributed derived data samples to the portal for educational use, while CMS has also previously made public some 27 TB of high-level collision data.
“We’ve taken a big step forward with this new release,” explains Tim Smith, leader of the Collaboration, Devices and Applications group in the CERN IT Department. “In 2014 we opened up collision data that, by itself, was usable only in quite a specific way. Since this new set includes both collision data and associated simulated data you can do full analysis on it. Preparing the data, analyses, and information for release has taken experts from the CMS Collaboration, the CERN IT Department, and the CERN Scientific Information Service about six months.”
Education and inspiration
“Once we’ve exhausted our exploration of the data, we see no reason not to make them available publicly,” explains Kati Lassila-Perini, a CMS physicist who leads these data-preservation efforts. “The benefits are numerous, from inspiring high-school students to the training of the particle physicists of tomorrow. And personally, as CMS’s data-preservation coordinator, this is a crucial part of ensuring the long-term availability of our research data.”
Notably, CMS is also providing the simulated data with the same software version that should be used to analyze the primary datasets. Simulations play a crucial role in particle physics research and CMS is also making available the protocols for generating the simulations that are provided. The data release is accompanied by analysis tools and code examples tailored to the datasets.
The data released was generated through collisions that occurred in 2011. “Physicists are now working to develop notebook-style analyses, so that the data processing steps are clearly described and ensure that information about the data is not lost,” says Smith.
Open data leads to new Collaboration
These data are being made public in accordance with CMS’s commitment to long-term data preservation and as part of the collaboration’s open-data policy.
The potential of open LHC data has already been demonstrated with the previous release of research data. A group of theorists at Massachusetts Institute of Technology (MIT) wanted to study the substructure of jets — showers of hadron clusters recorded in the CMS detector. Since CMS had not performed this particular research, the theorists got in touch with the CMS scientists for advice on how to proceed. This blossomed into a fruitful collaboration between the theorists and CMS revolving around CMS open data.
The IT behind the portal
“The portal was built on top of the Invenio digital library framework,” says Tibor Simko, the technology leader behind the CERN Open Data Portal. “The development was done in a fully open manner and all the source code is available on GitHub.”
The CERN Open Data Portal uses the same data store that CERN researchers use for their own physics analyses, called EOS. When you use the CERN virtual machine (CernVM) from the portal, the analysis software is capable of downloading the data in chunks as required, so you don’t need to worry about swamping your computer.