Data Repository Platforms: A Primer

Nicole Betancourt

Last updated on March 10, 2021

While scholars generally believe in the value of sharing and preserving research datasets, many do not believe it’s worth their time to do so. And, when they do invest their time in data sharing and preservation, they tend to have a preference for doing so in an independent and self-reliant fashion. These are issues that we have not only documented through our long-standing national faculty survey but ones that we have contended with in our own work as social science researchers conducting large-scale survey studies.

Data sharing can be valuable for a whole variety of reasons. It permits others to replicate analyses and results, spurs additional research with pre-existing datasets, improves methods of data collection through the scrutiny of others, and broadly encourages alternative perspectives which can promote a diversity of analyses and conclusions. Additionally, sharing research data contributes to societal knowledge and can prevent other researchers from sinking resources into duplicating data collection efforts by allowing them to work off of pre-existing data. Particularly during the COVID-19 pandemic when faculty are encountering challenges in conducting research with newly-generated data, leveraging data that has already been collected and analyzed can be particularly useful. Many scholars weigh these benefits against the aforementioned challenges, along with funder mandates, when determining whether and how to deposit their data.

Since there is a robust landscape of research data sharing spaces, we decided to conduct exploratory, high-level research on a number of data repositories, primarily to inform our own data deposit protocols. We regularly deposit data from the US Faculty Survey, Library Director Survey, as well as several other research projects with ICPSR. Recognizing that our research on a variety of characteristics of data repositories may yield utility for other researchers, today we are publishing a summary of our findings.

Below you can find seven repositories compared side-by-side in tabular format. We have highlighted particular factors that are key for informing decision-making: disciplinary scope, typical timelines for processing datasets, associated costs, and services offered (such as data curation).

Repository name	Disciplinary scope	Offers data curation?	Length of time to curate data	Cost of data deposit	Accessing data deposits
Dryad	General repository with a focus on scientific and medical datasets	Yes	Approximately one day	There are a variety of paid membership plans available to institutions and publishers for depositing datasets. Pricing is based on factors such as the level of research grant funding.	No cost associated with accessing datasets
Figshare	General repository	No, but available for Figshare for Institutions	N/A	No cost associated with depositing datasets	No cost associated with accessing datasets
Harvard Dataverse	General repository	Yes	Free consultation and assessment takes between 1-3 hours, but actual length of time for curation varies depending upon the complexity of the data	No cost associated with depositing datasets	No cost associated with accessing datasets
ICPSR	General repository with a focus on social science datasets	Yes	Once assigned to a curator, the curation process for most studies takes anywhere from 4-8 weeks, but can take up to several months depending on the complexity of the data and the level of curation needed.	No cost associated with depositing datasets; there may be additional fees for particularly large datasets	Access to ICPSR requires paid membership through a member institution, though some datasets are open-access.
Mendeley Data	General repository	No	N/A	Free and paid memberships are available for storing and depositing datasets with three different paid monthly plans based on total storage space	No cost associated with accessing datasets.
Roper Center for Public Opinion Research	Primarily includes public opinion survey datasets	Yes	Approximately one week	No cost associated with depositing datasets	Both members and non-members are able to access data. Non-members pay a fee associated with the data.
Zenodo	General repository	No	N/A	No cost associated with depositing datasets	No cost associated with accessing datasets

Naturally, there are different tradeoffs associated with choosing one repository over another.

Reach and impact: A number of these repositories are general in terms of disciplinary scope, whereas some primarily cater to the social sciences or sciences. This could help shape which repository researchers might select depending on the intended audience for re-using their data. Similarly, who has the ability to access datasets in each of the repositories, and at what cost, should be considered. If open-access is a priority, it might make sense to select Mendeley Data, Zenodo, or Dryad, as datasets in these repositories are freely accessible to the public. Harvard Dataverse and Figshare let scholars choose whether datasets are freely accessible or restricted. On the other end of the spectrum, ICPSR and The Roper Center require payment or membership to access datasets.

Cost to deposit: A number of the repositories require institutional or individual membership or have fees associated with depositing research data. If cost of dataset deposit is a concern, Figshare, Harvard Dataverse, The Roper Center, and Zenodo do not charge for depositing research data, and Mendeley Data has a free membership option as well.

Data curation: Data curation services involve processes that validate data, such as ensuring that there is alignment with the questionnaire, codebook, and dataset of research projects. Data may also be made available in multiple file formats, such as CSV, SAS, and SPSS files. Data curation services can also serve as an additional check prior to data being made available to others, and is a feature that we highly value at Ithaka S+R. Dryad, Harvard Dataverse, ICPSR, and The Roper Center all offer data curation services, whereas Figshare offers data curation through an additional subscription service, and Mendeley Data does not offer data curation. It is important to note that data curation can add to the length of time before a dataset becomes available in any given repository. For Dryad, the length of time to curate and deposit data is typically one day, while for The Roper Center this can take about one week, and for Harvard Dataverse, this typically varies depending upon the complexity of the data. If the length of time before a dataset is available is not of great concern, ICPSR takes approximately four to eight weeks to curate most datasets. However, depending upon the complexity of the data, this process can take several months, so ICPSR also has developed and offers another service–openICPSR–that does not offer data curation in which data can be quickly deposited. If data curation is not important and speed is ideal, Figshare and Mendeley Data may be good choices.

We hope that the 2020 snapshot summarized here can help to serve other researchers, especially those in the social sciences, as they weigh the pros and cons of each repository. Of course, these repository providers often change and adapt their services and offerings. As you consider preserving and sharing your research data, we would be happy to discuss these options with you. Please email me at nicole.betancourt@ithaka.org.

I thank Janan Shouhayib, a PhD student at The Graduate Center, and intern with the Ithaka S+R surveys and research team over the spring and summer of 2019, for her contributions to this exploratory research.

Topics:

Collections and preservation

Data management

Libraries

Research practices

Scholarly communication

Tags:

Data repositories

Data sharing

research data

This work is licensed under a Creative Commons Attribution/NonCommercial 4.0 International License. To view a copy of the license, please see http://creativecommons.org/licenses/by-nc/4.0/.