• Subscribe

The truth about scientific data

Daniel Reed’s job is to explore the future. As the University Chair in Computational Science and Bioinformatics at the University of Iowa, he gets plenty of opportunity to do so, especially in connection to high-performance computing, big data, and edge networks.

In the keynote address at Internet2’s Global Summit, Reed will explore balancing the needs of supporting the academic enterprise from a technology, application, and cyberinfrastructure perspective. 

Science Node recently caught up with Reed to talk about the future of infrastructure, the cost of keeping data, and what scientists and CIOs can learn from librarians.

What are some of the biggest technical issues you see facing research organizations right now?

I sometimes like to say that, in the technology business, the questions don’t change but the answers do. Right now, we're seeing an explosion of interest in internet-connected devices — the internet of things (IoT). That's true not only in the consumer space, but in the science space as well.

<strong>Scientists increasingly rely</strong> on internet-connected sensors to collect information, resulting in an explosion of data for research organizations to store and manage. Courtesy NOAA.Historically a lot of the challenges in networking were the so-called 'last mile,' that notion of how to actually deliver band-width out at the edge. In the past that meant offices and desktops.

But now it increasingly means this world of devices at the edge and the intelligence migrating out to them.

Connected to that is the explosive growth of data, and the strain that puts not only on operating infrastructure and the network, but the questions of prominence and stability that research agencies and universities are dealing with. We have to rethink what it means to support data.

How are you going to do that?

I think we can learn an important lesson from librarians — and that’s that they throw things away. If you think about the nature of curation, it is carefully selecting which things to preserve because one cannot preserve everything.

Previously, we've had the luxury of decreasing storage prices to let us keep adding capacity. But if you look at the costs associated with saving the volume of data we're producing, it's not sustainable.

But data! How can research organizations choose what data is worth keeping?

It’s not easy. If you're an individual faculty member, often your notion of how much storage costs is defined by the last multi-terabyte desktop you bought from Amazon or Best Buy — which is to say it's really cheap. But institutional storage, where you have backups, protection, and security, is quite expensive relative to consumer technology.

<strong>Library lessons.</strong> Due to the limits of physical space, librarians must continually edit their collections. The time has now come, says Daniel Reed, when scientists too may have to choose which data is worth keeping. Courtesy New Jersey Library Association. <a href='https://creativecommons.org/licenses/by-nc-nd/2.0/'>(CC BY-NC-ND 2.0)</a>In computing, the success of the so-called condo model, where individual researchers buy nodes and plug into racks an institution has bought, has been quite successful because there’s an economic value proposition for the individual researchers. But in the data context, it's almost the opposite. If you want data to be archived institutionally, it's going to cost a lot more than keeping it locally.

On the other hand, increasingly the value of data lies not just in the domain in which it was captured, but in other domains. When the reward metrics within a domain are no longer applied to keep data, there may be institutional or organizational reasons to keep it for other groups because that fusion is often where new insights occur.

Ultimately, I don't know any simple way to choose what data to keep other than ongoing conversations. Elevated beyond the individual researcher to interdisciplinary decision groups who can think about the kind of things that not just their group might be interested in doing, but that other groups might be interested in pursuing as well.

It sounds like this is a tough moment to be managing research infrastructure. Is there any good news?

One of the really interesting things that's happening is that we're democratizing access to science. The bread and butter of intellectual discovery is being done by individuals in small teams. This parallels what’s happening in the consumer space, not only through new kinds of devices and dramatically lower costs, but also the rise of tool kits that make it much simpler for people to do things.

<strong>Daniel A. Reed</strong> is University Chair in Computational Science and Bioinformatics, and Professor of Computer Science, Electrical and Computer Engineering, and Medicine at the University of Iowa. For example, there's enormous excitement around Jupyter notebooks for data analytics. They let people share complex data analyses with the ease of sharing a webpage across collaborators and groups.

This means that for all the talk about the struggles around scientific reproducibility and validation, some technologies are now making it much easier to not only say, ‘Here's my paper and results,’ but ‘Here's exactly, literally, the code I used to do it. You can take your own data or my data and you can verify those calculations if you want.’

We sometimes focus too much on the big shiny objects at the high-end because that's where the money is. But there's a revolution taking place in science by empowering the long tail.

As the mathematician Alfred North Whitehead once said, “Civilization advances by extending the number of important operations which we can perform without thinking about them.” We're enabling a much larger number of things for researchers to do without thinking about how they work. 

Join the conversation

Do you have story ideas or something to contribute? Let us know!

Copyright © 2018 Science Node ™  |  Privacy Notice  |  Sitemap

Disclaimer: While Science Node ™ does its best to provide complete and up-to-date information, it does not warrant that the information is error-free and disclaims all liability with respect to results from the use of the information.

Republish

We encourage you to republish this article online and in print, it’s free under our creative commons attribution license, but please follow some simple guidelines:
  1. You have to credit our authors.
  2. You have to credit ScienceNode.org — where possible include our logo with a link back to the original article.
  3. You can simply run the first few lines of the article and then add: “Read the full article on ScienceNode.org” containing a link back to the original article.
  4. The easiest way to get the article on your site is to embed the code below.