Concerned About Bots Taking Over Your Survey?
Reflections on Maintaining Data Integrity
Last week, a researcher from the University of Minnesota, Melissa Simone, shared an honest and frightening account of having bots infiltrate data gathered via an online research study. Within 12 hours of launching a survey, Simone found over 350 responses within the resulting dataset from bots. The process of identifying, screening for, and cleaning these data took hundreds of hours, she reported via Twitter.
My online #researchstudy was recently infiltrated by bots. I haven't shared this story publicly because I felt a bit like it was my fault. I'm putting my pride aside because I think #dataintegrity is and will be a growing issue in survey data and is not discussed enough (1/n)
— Melissa Simone, PhD (She/Her) (@m_simonephd) September 17, 2019
Simone goes on to share a number of recommendations to prevent bots from accessing surveys, including logic/inattention checks, open-ended responses, captchas, and repeat questions. In one tweet, she stresses the importance of restricting access to the survey to the population under study, stating, “NO PUBLIC LINKS. EVER. DON’T DO IT.”
In particular, we very much agree with this last recommendation to restrict access to a predefined population and regularly implement it through the Ithaka S+R surveys program, including our national surveys of faculty members and library directors as well as our local surveys of students and faculty members. Independently and in close coordination with partner institutions, we carefully define the parameters of survey populations and generate unique links for each survey invitee, which not only controls for who can take a survey but also who cannot. This approach is desirable not only to prevent bots and other compromises to data integrity but also for a variety of other reasons explored in this post which can help maximize survey engagement.
Screening respondents and tracking response rates
Distributing unique survey links to a predefined population can simplify the process of screening participants at the outset, and may even override the need for screening altogether. By using a restricted sample of invitees rather than a public survey link, you can not only prevent bots from falsely contributing to your survey but also track true responses in a more sophisticated manner. For example, if you circulate targeted messaging to different groups within your sample, you can monitor the response rates of those specific groups, which allows for the opportunity to use more customized strategies to increase response rates within those specific groups in future messaging.
When working with a predefined sample of invitees, it is also possible to communicate with invitees in a more customized way about the survey. Many survey platforms provide an option to distribute customized messaging, in which you may address recipients of your survey invitations by their title, first name, or full name, and these forms of personalization have often been associated with higher response rates. You might also consider circulating targeted messaging to different groups within your sample, such as those who are located on different college campuses. Additionally, the use of unique links allows for tracking respondents who have not yet started or completed a survey, thus allowing for sending follow-up messaging to those who have not yet finished the survey.
Further, some survey platforms allow for the inclusion of embedded data fields — additional variables that are tied to individual contact information — to perform more sophisticated analysis of survey responses, including the comparison of population demographic parameters against the resulting sample.
Using embedded data fields can also help to cut down on the number of questions included in a survey, provided these data (i.e. demographic variables such as age, gender, title, etc.) are readily accessible, and may also ensure greater data validity compared to individual reporting. Further, embedded variables can also be used to program various survey logic within an instrument to control what questions are displayed to and hidden from participants. For example, if there was a set of questions that should only be distributed to a subset of participants, you could use an embedded data field, rather than including one or more additional survey questions to obtain the same information, to control who actually receives this set of questions.
Maintaining anonymity or confidentiality
Lastly, when distributing unique survey links, concerns about anonymity and confidentiality may arise, given that unique links are directly tied to respondent contact information. However, unique links can be employed without comprising respondents identity, as a number of survey platforms have the ability to deploy these links while simultaneously anonymizing survey responses by making identifying information (i.e. name, email address, IP address, etc.) unavailable to survey administrators. This is also the case when using embedded metadata, as these additional data fields can be tied to individual responses while also removing access to identifying information. Regardless of whether a survey is fielded via unique or public links, respondents should be made fully aware of whether their responses will be maintained anonymously or confidentially, and, when anonymously, survey administrators should take great care in ensuring respondents are not re-identified.
Distributing surveys to a predefined population via unique links assures additional levels of security that would otherwise be absent from those that are fielded using a single, public-facing, open-access link. Through this approach, survey administrators gain the ability to monitor and increase response rates in more targeted ways, embed data fields to perform additional analysis of responses, and engage with their participants in more personal ways by customizing messaging to recipients. This practice is one of many that we at Ithaka S+R regularly utilize and endorse in an effort to not only ensure data integrity and security, but increase survey engagement and the representativeness and generalizability of the datasets that result.