People are becoming increasingly uncomfortable with sharing personal information—like email addresses, names of their spouses, and demographics such as race-ethnicity and gender—with their employers and other organizations more broadly. Many people are concerned about their privacy as more and more data are collected in a wide range of contexts. Some are apprehensive about research and associated data collection in general due to its history of racism, sexism, and other forms of discrimination and mistreatment in a variety of fields. And others simply just don’t want to answer questions about their identities even if they do not have a specific concern. The identity of the individual or group collecting these data, for example, an independent researcher or an employer, can also impact these concerns.

All of these factors are essential for researchers and other professionals to consider when designing their research protocols and instruments, and are particularly of great concern for us at Ithaka S+R when deciding what demographic questions to include or exclude from our survey questionnaires. In this post, I unpack some of the most important tradeoffs we weigh when making these determinations. While not all survey takers share the anxieties outlined above, by including only essential demographic questions, such as those hypothesized to impact results or those that are central to the purpose of the survey, we can help ease participants’ concerns, increase response rates, and ultimately improve the usefulness of the data collected.

Scope of study

One of the first—and most important—considerations in designing the demographics section of a survey is the scope of the study. For example, are individual identities central to the purpose of the study, as was the case in several of our recent publications interrogating employee demographics at ARL libraries and art museums? In such cases, inclusion of a wide variety of demographic questions may be warranted. Or are there particular hypotheses that the researcher has related to one or more demographic questions (e.g. that gender differences are expected)? Here, it may make sense to only include those demographics. When demographics are neither central to the purpose of the study nor are there hypotheses about their impact, it may be beneficial to only include a small number of items, if any, rather than a comprehensive set of demographic questions.

Further, as a rule of thumb, we only report results from subgroups—that is, subsets of respondents—if there are at least 10 participants in a given subgroup, both to protect participant privacy and maintain statistical power. When we are dealing with larger datasets, we often raise this threshold to 30 participants. When this threshold is not met, we either combine response categories, if it makes sense to do so, or exclude participants from subgroup analyses. For example, we may roll up states into regions to permit some level of analysis by geography. But, we often have to exclude transgender as a separate category for analysis (while still reporting the total number of respondents in our demographics tables) if there are less than 10 respondents who identify with the group. The next post in this series will further explore strategies for analyzing and reporting demographic data.

Anonymity, confidentiality, and risks of re-identification

In addition to individual concerns about privacy, survey anonymity or confidentiality can influence which questions participants answer. We consider our surveys anonymous when names, emails, and other personally identifiable information (e.g. student ID) is not tied to survey responses. Surveys are deemed confidential when we collect this information but do not share it with others outside of the immediate project team.

If a survey is anonymous or confidential, it is beneficial, and in fact required by institutional review boards, to inform participants of this at the outset, whether in introductory text or in an informed consent document. Such disclosure can help participants feel more secure about sharing their identities in the survey.

However, even when a survey is classified as anonymous, there is never a 100 percent guarantee that participants cannot be re-identified if someone tries hard enough, especially when participants have multiple identities that are underrepresented in the sample of respondents. Sharing at the outset how the data will be stored, protected, and shared can help alleviate these concerns. The design of demographics questions can also aid in this endeavor. For example, rather than asking for participants to write in their age, which can potentially be used to re-identify them, especially in conjunction with other demographic variables, we generally ask participants to select from their age from a predetermined range (e.g. 22-34, 35-44, etc.). While this makes analysis slightly more tricky, the benefit of making the question more comfortable for participants to answer outweighs this cost.

Data sharing

Finally, consider how the data will be shared both inside and outside the organization collecting it (e.g. by depositing in a data repository), if applicable. In preparing the data of our most recent Library Survey in 2020 for deposit with ICPSR, my colleagues and I reviewed all demographic questions for potential overlap that could re-identify participants. In academic libraries and other fields with few employees who have particular identities, such as employees of color, including information on race-ethnicity, gender, and other institutional characteristics potentially violates the anonymity and confidentiality of a survey even if names or emails are not attached to the results. Thus, we often exclude individual characteristics such as race-ethnicity and gender from our dataset deposits. We also frequently will combine demographic response options, such as our youngest age groups, when there are few participants who select a given response option. These strategies allow us to ask potentially useful demographic questions such as race-ethnicity and gender while maintaining the anonymity or confidentiality of our surveys more broadly.

Alternatives to collecting demographic information from respondents

One alternative to asking a variety of demographic questions in a survey, when it is deemed necessary to collect these data, is to pull in available information from another platform, such as a student or human resources information system. Survey platforms like Qualtrics allow researchers to upload contact lists that include professional and personal information on all invitees, such as academic department, years with the college or university, race-ethnicity, gender, and more. This approach allows participants to take the survey without having to repeatedly disclose their identities and means that researchers do not have to write and revise their demographic questions. However, this does limit researchers to using the available data and the categories that were used during data collection. Many HR systems, for example, categorize gender in a binary. Some only allow people to select one race category, resulting in those who identify with more than one race grouped together in a “two or more races” category. If this strategy is used, you should inform participants that these data will be connected to their responses, especially if the survey is advertised as anonymous or confidential.

Questions to consider

When deciding on which demographic questions to include or exclude, consider the following questions:

  • What is the purpose of including this demographic question? What hypotheses do I have about how it might interact with other variables in the survey and/or how will I use it to understand the representativeness of my sample of respondents against the broader population?
  • Is my survey anonymous or confidential, and have I adequately disclosed this status to respondents? What can I share with respondents about how this status will be maintained if there are few respondents in these subgroups?
  • Do I plan on sharing these data for other researchers to use? If so, how will I continue to maintain anonymity or confidentiality while also addressing issues of possible re-identification?
  • What can I disclose up front to participants to ease any discomfort they have about sharing their personal information? Are there other ways I can link this information to responses without asking participants directly? If so, am I comfortable with the data provided from other sources?

Reflecting on the answers to these questions when designing a survey, rather than just relying on a standard set of demographics across all projects, can help balance the comfort of your participants with the validity and usefulness of the data collected.

____________________________________________________________________

This blog post is the second in a series on our demographic questions related to individual identity. A previous post in this series included recommendations for writing and revising demographic questions effectively and inclusively. A subsequent and final post in the series will focus on analyzing demographic variables in an inclusive way while maintaining scientific rigor. We welcome your comments, questions, and reactions to these strategies and our sample questions in the comment box below or via email at jennifer.frederick@ithaka.org.