Progress in Biomedical Data Sharing
Headlines from the Recent NIH Workshop
The biomedical sciences have been a key focus area for efforts to promote research data sharing. Effective data management and sharing policies have the potential to improve research efficiency and accuracy, with real implications for human health. Last week, I attended a workshop hosted by the National Institutes of Health (NIH) on “Establishing a FAIR Biomedical Data Ecosystem: The Role of Generalist and Institutional Repositories to Enhance Data Discovery and Reuse.”
NIH has been making significant strides in encouraging its funded researchers to share their data of late. With the publication of a draft data sharing policy in November 2019, the NIH is moving into closer alignment with the Gates Foundation and the Wellcome Trust, two major nonprofit biomedical research funders, both of which have had data sharing requirements in place for some time. (NIH grant programs currently operate under a patchwork of different data sharing policies, helpfully summarized here.) Earlier last year, NIH announced a one-year pilot partnership with figshare to offer a repository for NIH-funded datasets. Last week’s workshop represented, in part, an opportunity for the NIH to engage with stakeholders in biomedical data curation and sharing to assess the potential of this approach, and how it is likely to interact with the myriad other data sharing efforts already underway.
Below, I share just a few of my reflections on the workshop. Videocasts of the full conference presentations, as well as the rich and thought-provoking notes collected during seven breakout sessions, are available online.
Architecting Culture Change
A recurring theme throughout the workshop was the importance–and challenges–of creating “culture change” around data sharing. Speakers urged “putting the researcher at the center” and shared real-world experiences of struggling to change users’ workflow habits following the creation of a new tool. One memorable slide proclaimed “If you build it they will come”–with “they will come” crossed out. Nici Pfeiffer’s illustration of culture change as a pyramid–resting on a infrastructural foundation and built up through incentivization and community building, with regulation the final step–powerfully summed up these discussions.
This. The reality of culture change. We have a lot of work to do for data and software sharing to be the new normative behavior for all… Says Nici Pfeiffer #NIHData @BrianNosek @OSFramework @nicipfeif pic.twitter.com/ViEOlByuLL
— Shelley Stall (@ShelleyStall) February 12, 2020
The workshop’s emphasis on the continuing importance of human behavior in advancing research data sharing struck me as noteworthy in contrast to other FAIR data events I’ve attended. At the CODATA workshop I attended last April, for instance, ambitious infrastructure projects like the Public Health Train took center stage. While neither approach is necessarily better, the organizers of the NIH workshop certainly succeeded in bringing together speakers and designing breakout sessions to orient discussion toward highly practical solutions to the state of data sharing today. Additionally, it is probably fair to say that biomedical research, generally speaking, is more advanced in data sharing than most other STEM research areas. This means that many stakeholders have practical experiences of building sustainable data communities to draw on.
“Cooptition”: Generalist, Institutional, and Community Models
My colleague Roger Schonfeld recently wrote about “two competing visions” for data sharing, but the buzzword at this workshop–a portmanteau borrowed, for better or worse, from Silicon Valley–was “cooptition.” This is the idea that proponents of different approaches and initiatives will work together toward a common goal competitively, accepting that some efforts will succeed while less viable alternatives fall away.
The workshop’s scoping focused on the relationship between two major models of research data sharing: generalist repositories (such as figshare, Dryad, Zenodo, Mendeley Data, and Dataverse) and institutional repositories, which are hosted by academic libraries to preserve research outputs produced at specific universities. A third model–one that my Ithaka S+R colleagues and I have written about extensively–is the data community, a network of researchers who share and reuse a particular type of data, usually facilitated through a domain repository.
Workshop participants discussed all three of these models and made important steps toward envisioning productive relationships between them. For instance, in her talk on the work of the Data Curation Network, Lisa Johnston articulated a vision of “future proof” data sharing, drawing on the strengths of both the data communities model and institution-based curation. Whereas data communities facilitate discovery, they face challenges in developing sustainability. Lisa argued that institutional repositories can keep data safe while allowing data communities to “live and die,” and pointed to OAI-PMH as a solution that allows domain repositories to “mirror” the contents of institutional repositories.
New Roles for the Private Sector
In his remarks, Sayeed Choudhury described the development of research data sharing as an opportunity for collaboration between the “triad” of academia, government, and the private sector. The workshop showcased these possibilities, featuring talks not only from a variety of academic and nonprofit data sharing initiatives, but from Anita de Ward of Elsevier, Varsha Khodiyar of Springer Nature, and Mark Hahnel of figshare.
I was particularly interested to learn about a new entrant to the research data management scene, Flywheel. Flywheel is a software platform supporting full lifecycle research data management for life sciences and clinical research, currently raising Series B funding. Travis Richardson, CEO, spoke about the product, focusing on how it builds reproducibility and data curation into research processes to ease the burden of data and software sharing. Flywheel focused on selling to academic institutions when it launched in 2016 and currently counts thirty universities among its customers; since then, it has also begun selling to pharmaceutical companies as well, who now (perhaps unsurprisingly) constitute its most significant revenue stream. (Side note: I attended a meeting of the Pharma Documentation Ring on FAIR data sharing last year, and learned that the pharmaceutical industry is increasingly enthusiastic about sharing the research data it generates.)
Perhaps Flywheel’s most exciting feature is a system which facilitates reproducibility, software sharing, and version control by preserving code in manageable chunks. These chunks, or “gears,” are packaged using Docker container technology and can be shared through a “gear exchange.” Currently a cloud-based offering with R, Python and MatLab built in, Flywheel is committed to a vision I’ve heard articulated by many FAIR data proponents: instead of downloading shared datasets to work with on local computers, researchers will one day “visit” datasets on the cloud and analyze them in situ.
Conclusion
It will be interesting to see how biomedical data sharing policies and infrastructures in the United States continue to develop in the coming years.The NIH and its subsidiary the National Library of Medicine (NLM) appear to be taking an increasingly active role in leading and convening stakeholders who have been working in this space for some time. NLM is developing a data science skills training curriculum and certification for academic librarians, and this week, NIH’s Office of Data Science Strategy will host a virtual workshop on data metrics, focused on “assessing dataset and data resource value and reach.” The high level of cooperative, practically-focused engagement we are seeing from government, nonprofit, and private spheres is a promising sign for these future efforts.
A great deal of work remains to be done to achieve sustainable, effective research data sharing across biomedical and other research fields. This Spring, Ithaka S+R will partner with a cohort of 20 U.S. academic libraries to investigate the evolving research activities and support needs of scholars who work with big data across a variety of fields. We are also working to create a method for systematically identifying emergent data communities and their concrete needs, with a goal of eventually piloting this method with on-the-ground providers of research support services. For more information about these and other projects, please email Danielle Cooper, Ithaka S+R’s manager of collaborations and research, at Danielle.Cooper@ithaka.org.