Data Repository Platforms: A Primer
Last updated on March 10, 2021
While scholars generally believe in the value of sharing and preserving research datasets, many do not believe it’s worth their time to do so. And, when they do invest their time in data sharing and preservation, they tend to have a preference for doing so in an independent and self-reliant fashion. These are issues that we have not only documented through our long-standing national faculty survey but ones that we have contended with in our own work as social science researchers conducting large-scale survey studies.
Data sharing can be valuable for a whole variety of reasons. It permits others to replicate analyses and results, spurs additional research with pre-existing datasets, improves methods of data collection through the scrutiny of others, and broadly encourages alternative perspectives which can promote a diversity of analyses and conclusions. Additionally, sharing research data contributes to societal knowledge and can prevent other researchers from sinking resources into duplicating data collection efforts by allowing them to work off of pre-existing data. Particularly during the COVID-19 pandemic when faculty are encountering challenges in conducting research with newly-generated data, leveraging data that has already been collected and analyzed can be particularly useful. Many scholars weigh these benefits against the aforementioned challenges, along with funder mandates, when determining whether and how to deposit their data.
Since there is a robust landscape of research data sharing spaces, we decided to conduct exploratory, high-level research on a number of data repositories, primarily to inform our own data deposit protocols. We regularly deposit data from the US Faculty Survey, Library Director Survey, as well as several other research projects with ICPSR. Recognizing that our research on a variety of characteristics of data repositories may yield utility for other researchers, today we are publishing a summary of our findings.
Below you can find seven repositories compared side-by-side in tabular format. We have highlighted particular factors that are key for informing decision-making: disciplinary scope, typical timelines for processing datasets, associated costs, and services offered (such as data curation).
|Repository name||Disciplinary scope||Offers data curation?||Length of time to curate data||Cost of data deposit||Accessing data deposits|
|Dryad||General repository with a focus on scientific and medical datasets||Yes||Approximately one day||There are a variety of paid membership plans available to institutions and publishers for depositing datasets. Pricing is based on factors such as the level of research grant funding.||No cost associated with accessing datasets|
|Figshare||General repository||No, but available for Figshare for Institutions||N/A||No cost associated with depositing datasets||No cost associated with accessing datasets|
|Harvard Dataverse||General repository||Yes||Free consultation and assessment takes between 1-3 hours, but actual length of time for curation varies depending upon the complexity of the data||No cost associated with depositing datasets||No cost associated with accessing datasets|
|ICPSR||General repository with a focus on social science datasets||Yes||Once assigned to a curator, the curation process for most studies takes anywhere from 4-8 weeks, but can take up to several months depending on the complexity of the data and the level of curation needed.||No cost associated with depositing datasets; there may be additional fees for particularly large datasets||Access to ICPSR requires paid membership through a member institution, though some datasets are open-access.|
|Mendeley Data||General repository||No||N/A||Free and paid memberships are available for storing and depositing datasets with three different paid monthly plans based on total storage space||No cost associated with accessing datasets.|
|Roper Center for Public Opinion Research||Primarily includes public opinion survey datasets||Yes||Approximately one week||No cost associated with depositing datasets||Both members and non-members are able to access data. Non-members pay a fee associated with the data.|
|Zenodo||General repository||No||N/A||No cost associated with depositing datasets||No cost associated with accessing datasets|
Naturally, there are different tradeoffs associated with choosing one repository over another.
Reach and impact: A number of these repositories are general in terms of disciplinary scope, whereas some primarily cater to the social sciences or sciences. This could help shape which repository researchers might select depending on the intended audience for re-using their data. Similarly, who has the ability to access datasets in each of the repositories, and at what cost, should be considered. If open-access is a priority, it might make sense to select Mendeley Data, Zenodo, or Dryad, as datasets in these repositories are freely accessible to the public. Harvard Dataverse and Figshare let scholars choose whether datasets are freely accessible or restricted. On the other end of the spectrum, ICPSR and The Roper Center require payment or membership to access datasets.
Cost to deposit: A number of the repositories require institutional or individual membership or have fees associated with depositing research data. If cost of dataset deposit is a concern, Figshare, Harvard Dataverse, The Roper Center, and Zenodo do not charge for depositing research data, and Mendeley Data has a free membership option as well.
Data curation: Data curation services involve processes that validate data, such as ensuring that there is alignment with the questionnaire, codebook, and dataset of research projects. Data may also be made available in multiple file formats, such as CSV, SAS, and SPSS files. Data curation services can also serve as an additional check prior to data being made available to others, and is a feature that we highly value at Ithaka S+R. Dryad, Harvard Dataverse, ICPSR, and The Roper Center all offer data curation services, whereas Figshare offers data curation through an additional subscription service, and Mendeley Data does not offer data curation. It is important to note that data curation can add to the length of time before a dataset becomes available in any given repository. For Dryad, the length of time to curate and deposit data is typically one day, while for The Roper Center this can take about one week, and for Harvard Dataverse, this typically varies depending upon the complexity of the data. If the length of time before a dataset is available is not of great concern, ICPSR takes approximately four to eight weeks to curate most datasets. However, depending upon the complexity of the data, this process can take several months, so ICPSR also has developed and offers another service–openICPSR–that does not offer data curation in which data can be quickly deposited. If data curation is not important and speed is ideal, Figshare and Mendeley Data may be good choices.
We hope that the 2020 snapshot summarized here can help to serve other researchers, especially those in the social sciences, as they weigh the pros and cons of each repository. Of course, these repository providers often change and adapt their services and offerings. As you consider preserving and sharing your research data, we would be happy to discuss these options with you. Please email me at email@example.com.
I thank Janan Shouhayib, a PhD student at The Graduate Center, and intern with the Ithaka S+R surveys and research team over the spring and summer of 2019, for her contributions to this exploratory research.