Encouraging scholars to share research data with one another promises to increase research efficiency, reproducibility, and innovation. In a recent issue brief, Danielle Cooper and I argued for a new conceptual framework for understanding and supporting research data sharing: data communities. Data communities are formal or informal groups of scholars who share a certain type of data with each other, regardless of disciplinary boundaries, and those who wish to support data sharing should work to identify and support emergent data communities – groups of scholars who are on their way to developing an active, sustainable data sharing culture.

But what is it really like to build a data community from the ground up? To find out, I decided to interview experts who are at the forefront of growing emergent data communities in a variety of research areas. For the first of these interviews, I’m delighted to introduce Dr. Vance Lemmon. Dr. Lemmon holds the Walter G. Ross Distinguished Chair in Developmental Neuroscience at the Miami Project to Cure Paralysis and is a Professor of Neurological Surgery at the University of Miami. His lab works on drugs for axon (nerve fiber) regeneration following injuries to the spinal cord. He has also been a leader of recent efforts to facilitate data sharing among scientists working in the multidisciplinary field of spinal cord injury (SCI) research.

Photo by Rob Camarena

In what follows, Dr. Lemmon raises a number of issues relevant to many data sharing communities, including the challenges of heterogenous and large file types, the potential for progress through developing standards and adopting technologies like electronic lab notebooks, and the important role of patient advocacy groups.

Perhaps you could start by briefly describing what research data looks like for a typical SCI lab group.

First, a disclaimer. I conceived of the MIASCI/Regenbase Ontology projects and am on the steering committee for the ODC-SCI project. Any wonderful things I say about those projects should evaluated with those facts in mind. I also think ontologies are an amazing way to uncover hidden knowledge. I admit there are relatively few concrete examples of this.

I don’t know what a “typical SCI lab” is as they are very diverse and the answer might be different over time (2000 vs. 2019), even for a given lab. Setting that complication aside, a typical SCI lab would have, at least, animal experiments, so we can focus on that. 

SCI experiments typically involve the preparation of some potential therapeutic agent. How these agents are prepared can be documented in paper notebooks, electronic notebooks, and laboratory information management systems. Supporting data (text files, spreadsheets, images, data files from various machines) are often dispersed across many computers, servers and storage systems, most of which the P.I. is only vaguely aware of.

Animals are used for the injury experiments. Most often they are mice and rats but can include fish, amphibians, dogs, pigs, and non-human primates. The information about the acquisition, breeding, husbandry, along with many surgical conditions are most commonly kept on paper records so they are readily accessible to veterinary staff, local, national, and international regulatory groups and also the research team. In addition to paper and notebooks, now it is not uncommon for this kind of information to be captured using computers, cell phones, and tablets  using simple note taking tools or custom lab animal data apps.

SCI experiments include many kinds of outcome measures that can usually be lumped into behavior, anatomy, and physiology but now often include biochemistry and molecular biology, including next generation sequencing projects. Most labs use custom devices to do standardized behavioral tests such as foot falls on horizontal ladders, or hyperalgesia. Many tests use video to capture the behavior. In most labs, squads of undergraduates watch the slow motion videos stored on hard drives or storage systems to analyze kinematics, count mistakes, measure time intervals, etc. This data is entered into spreadsheets for later statistical analysis, but more and more frequently these outcomes are measured and stored using customized semi-automated programs with text or csv files as outputs. There are often regulatory restrictions on sharing the videos themselves more widely. Anatomical experiments are diverse but often involve counting the number of surviving cells, the size of lesions, the proteins present in and around the injury and the number of nerve fibers present above and below the injury site. The data involves images and manual or automated tracing software, and quantification of various outcomes using the acquired images. These open source or commercial software packages often give csv, Excel and AVI output files that can be shared. Recently, with development of fluorescent light sheet microscopes that can image very large volumes at high resolution, image stacks have become enormous, from 5 GB to 1 TB. This kind of data is very hard to “share.”  

Many biomedical scientists, including those interested in SCI, Traumatic Brain Injury (TBI), Multiple Sclerosis (MS) and stroke are using next generation sequencing approaches to study gene expression, transcription factor binding sites, non-coding RNAs, chromatin accessibility, and epigenetic modifications. All of these data sets are giant – primary data on the order of magnitude of terabytes – and will get worse with single cell sequencing. Primary data sets are stored on hard drives and cloud storage systems and due to their size are hard to share. Processed data are typically shared by posting at the Gene Expression Omnibus or other data repositories.

In 2017, you and colleagues wrote that the SCI research community is in a “state of readiness” to adopt more widespread data sharing practices. Why is it a good time for SCI researchers in particular to start sharing their data more effectively?

Maybe this was propaganda to spread the spirit of data sharing. But maybe not. The spinal cord injury research community in the U.S. and Europe is not enormous and it is well connected to SCI researchers in Asia, especially China and Japan. Meetings that attract the leading labs have, at most, a couple hundred participants who develop strong relationships over the years.  In the early 2000’s three SCI centers at U.C. Irvine, the Ohio State University and the University of Miami undertook a series of replication studies, with the support of NIH. Perhaps the most important conclusion from this exercise was that SCI papers do an inadequate job of documenting the methods used in different projects. The SCI community really paid attention to this. Improving the way we share data and metadata could help mitigate this problem and perhaps improve the robustness of experimental findings.

Another influence was patients with SCI, their patient advocates, and SCI foundations. They became concerned that progress was being limited by inadequate collaboration and data sharing and in the early 2000s began to press SCI researchers to do a better job of this. SCI researchers often meet with individuals with SCI and their families. Everyday, in our building, I see people who have experienced spinal cord injuries. I learn about how the injuries happened  and what their lives are like. This kind of interaction is very impactful. Trust me. SCI researchers listen. How can we not?

On the flip side, what challenges or barriers to widespread data sharing are unique to SCI research?

As far as I can tell, the barriers to data and metadata documentation and sharing are not unique to SCI research. They are common and well described. Perhaps one thing that separates SCI from some other domains (but not TBI, stroke or MS) is that a single “experiment” encompasses so many different procedures and processes: animals, surgeries (often multiple), treatments, and complex and diverse outcome measures, including behavior, anatomy, biochemistry and molecular biology.

I have a hobby of photographing devices used in SCI experiments when I visit labs around the world. While there are some standardized machines, I have noticed that almost everyone comes up with a custom device to hold the animals during the surgeries. There is no question in my mind that this introduces variability between labs. I have no idea how to share that kind of information about these small devices, but it is probably my lack of imagination. Inspired by others, my lab members have recently demanded a 3D printer! Perhaps in the future people will routinely share the 3D printer files.

You led the development of MIASCI – a standard format for the “minimum information about a spinal cord experiment” – through consultation with a large number of experts. More generally, what kinds of reactions do you get from other SCI researchers when you talk to them about sharing their data?

First the good news: people think it is a great idea. Now the bad news: almost no one complies with the standard. The paper was published in 2014. It has only been cited 34 times.  For comparison, MIAME, the reporting standard on microarrays (laboratory tests for gene expression), was published in 2001 and has over 2700 citations. MIAME has fewer than 30 elements. MIASCI has over 450 elements, many of which have to be used several times and might point to something like MIAME as a single element.

What are the most important supports SCI researchers need in order to cultivate a thriving data community?

I think SCI scientists are like other scientists. Sharing data takes resources. We understand the importance of collecting metadata about our experiments, even if it is just so we can do a good job of writing our papers. Publishers, especially the Nature publishing group, have really raised the bar regarding what they expect authors to submit. This is fantastic. Incentives to encourage sharing data and metadata are helpful. But for it to be done efficiently, scientists need to be proactive and know what metadata they need from the very beginning of an experiment. Reporting standards like MIASCI can inform those decisions. Universities could help, especially in this era of “big data,” by providing great electronic notebooks to ease data and metadata collection, preservation and sharing. 

The SCI funders (NIH, EU, and foundations in North America and Europe) and SCI patient advocates have really been stepping up. They are also interested in seeing all the findings from studies they have sponsored and thus have supported several meetings devoted to data sharing as well as different projects. The scientists interested in working on this problem have found a very supportive environment and this, I think, is a real game changer.  

Looking at the organizational level, there are a number of separate initiatives relating to SCI data sharing, including MIASCI Online, RegenBase, the Regenbase Ontology, the Spinal Cord Injury Ontology, PSINK, ODC-SCI, SCI Common Data Elements and CAMARADES. (This is the case for a lot of emergent data communities, as new projects grow from the ground up.) Is this dispersion a problem in your eyes? Do you have any expectations of future consolidation or coordination?

Is this a problem? Not from my perspective. The more, the merrier. In fact, there are not so many players. While the point persons have changed, there is a core group of perhaps 40 or 50 people in SCI who have worked together over the past 10 years to push the idea of data sharing forward. They have always had the support of relevant organizations who have facilitated workshops and mini-symposium at small and large neurotrauma and neuroscience meetings. As I mentioned above, I think funding agencies and patient advocates will continue to motivate different groups to talk at least and collaborate and consolidate at best.  

One unexpected barrier is governments discouraging or preventing the use of funds across borders. For example, our lab collaborates with SCI researchers internationally and some kinds of work requires pre-approval. This makes me nervous and extra cautious about how our collaborations are actually done. In addition, export control laws, enforced by the U.S. Department of Commerce, restrict data sharing with certain individuals and institutions. The only way to stay in compliance is to ask our university’s export control officer to assist us.

I do think we could do a better job of communication between the data scientists in Europe and North America working on this problem. As far as I can tell, there are no data scientists in Asia working on the SCI data sharing problem. Given the large numbers of papers being published in China on SCI, this needs to be fixed. I have spent significant time in SCI labs in China and the people doing the work, graduate students and postdocs, face language barriers in using data sharing tools. If there were Chinese based annotation tools, it would greatly facilitate contributing lab based data to national or international repositories.

What would you say has been the greatest success of the emergent SCI data community so far?

I think the recognition by such a large fraction of the SCI community that data sharing is the smart/right thing to do is the greatest success. I think Adam Ferguson (UCSF) and Karim Fouad’s (Univ of Alberta) ODC-SCI is the first significant implementation of this idea. Not only are they trying to capture published data but also the huge amount of “dark data” that is never published.  

What’s on the horizon for data sharing in SCI research?

With new funding from three international SCi foundations, the ODC-SCI project is well positioned to capture data at the level of individual animals from many SCI labs. This will enable  researchers to avoid experimental repetitions and mistakes, clinicians to capture the bigger picture of preclinical findings and data scientists and biostatisticians to ask new and interesting questions that will be powered by data from experiments involving thousands of animals. I strongly believe that such efforts will speed the development of novel therapies while reducing waste of precious resources, especially animals.

Interested in developing or improving research data services on your campus? Ithaka S+R is embarking on parallel collaborative research projects on Supporting Teaching with Data in the Social Sciences and Supporting Big Data Research. Participants will work alongside Ithaka S+R conduct a deep dive into their faculty’s needs and craft evidence-based recommendations. To find out more about having your library participate as a research site for these or future projects, please email danielle.cooper@ithaka.org.