Made by Hand

The Case for Manual Data Collection in an Era of Automation

Danielle Miriam Cooper, Sage Jasper Love

When designing a research study a key consideration is which research method—or methods—will yield the best insights. Here at Ithaka S+R we conduct applied research related to the education and cultural heritage sectors, and so we aim to collect evidence that can be used for immediate social benefit, such as towards improving policies and programs within institutions. Today we describe a method we regularly employ: manually collecting data from public facing websites. The information we can find through public websites is often a great way to quickly characterize activity patterns in the sectors we study.

Why would we ever choose to hand gather publicly available data in an era of machine automation and artificial intelligence? Below are three examples where creating bespoke, manual approaches to the data collection was the best method in the immediate sense for achieving our research goals. In all three cases the amount of data needed was relatively small, but the analysis was also fairly nuanced. In contrast, automated approaches to gathering information work best when the goal is to create a very large data set that will be analyzed with relatively few and highly discrete variables.

Tracking data service design trends

As policy makers, funders, and the public have grown their expectations that research data be made findable, accessible, interoperable, and reusable, universities are looking for opportunities to ensure that their researchers can successfully meet these expectations. A major strategy is to provide research data services in-house, which necessitates evidence on how to effectively design service models and ensure effective uptake.

In response to the need for evidence documenting trends and best practices, we looked to create an approach to track the size, distribution, and scoping of data services at US universities. Because multiple units may offer separate data services at the same university in parallel, and because the ways in which these services are communicated publicly vary considerably, the researchers relied on gathering public-facing information about relevant services on university websites as a proxy for the volume and variety of service offerings. This resulted in a novel step-by-step methodological process to systematically search university web pages applied to a random sample. For a more detailed description of the methodological approach, see the Appendix in the full report.

Taking this hand gathering approach to surfacing data service models had some unique advantages. A traditional survey would have been time consuming and also not as instructive. It would have required building a contact list from scratch, and given how data services are dispersed, it is unlikely that any one college contact would have been able to provide a campus-wide picture. Additionally, given the variety of job titles people providing such services hold, it would be difficult to determine who even should be included on a survey list.

The absence of a standard location, standard menu of research data services, and standard method of advertising services across higher ed institutions presents a significant barrier to developing an automated scraping methodology for inventorying research data services. Moreover, the services were often listed in unexpected units and under a variety of titles. Some services still listed on websites had been discontinued. To locate and categorize data services as accurately as possible, we needed to rely on contextual understanding. For example, finding services requires an awareness that some data services may not be explicitly titled, and accurately inventorying data services often requires following links and checking dates to determine whether a listed service still exists or not. There are simply too many variables in the data services ecosystem to confidently use an automated approach.

Collating community college communication patterns

In the Holistic Measures of Student Success (HMSS) project, Ithaka S+R explored strategies for defining and evaluating “success” for students at community colleges, providing us an opportunity to interrogate the metrics that community colleges traditionally use to assess student success—such as graduation rates, retention, and post-graduation employment.

In order to systematically review how community colleges are incorporating student success data into their communications, our researchers took a hand gathering approach similar to that used in the research data services project. They designed a procedural method to take stock of the data made publicly available on community college websites, which was then manually collected and organized in an Excel spreadsheet. They created a coding structure to analyze the patterns in how the student success information is represented on college websites. To determine a sample, and to ensure that institutions were fairly represented, they randomized a list of community colleges from IPEDS data. The resulting data from 75 community colleges across 36 states provided more granular insight into how federally required information about student success metrics is publicly disclosed on community college websites. As this analysis indicated, most student metrics data was difficult to locate, missing, or only available upon request. The random sampling approach also balanced out the challenge of needing to assemble a critical mass of information necessary for a quantitative analysis with the need to conduct a fairly nuanced evaluation of that information.

Pairing a systematic analysis of community college websites alongside a qualitative landscape review provided a clearer picture of how student success metrics are communicated, and by extension, what the user experience is like for those seeking that information. The findings counted by hand the number of clicks from an institution’s home page it typically takes to locate student metrics data, which therefore pointed to opportunities for improving communication strategies and web presence. The experiences from the analysis also pointed to other possible frequencies and patterns that could be measured through hand gathering, such as the number of clicks of broken links or the number of duplicates found before locating the appropriate web page where the data lives. Similar to the ways in which data services do not have a standard web presence or location, it is not clear in advance where these student metrics might live. Developing a web-scraping approach in this case would have potentially overlooked the many moving variables that the hand gathering approach revealed.

Surfacing patterns in department of corrections’ media policies

As part of our ongoing research on ensuring educational equity for those who are incarcerated, Ithaka S+R undertook an examination of media review directives in departments of corrections from all 50 states and the District of Columbia, evaluating common policies, procedures, and language across these documents.

The primary purpose of this project was to make information about these media review policies more widely accessible, as well as to gain a deeper understanding of how these directives drive learning outcomes in prisons. Media review directives in prisons put forward formalized standards on what materials incarcerated people can read, watch, and listen to, which in turn impacts the kind of content available in prison library collections, as well as what kinds of books can be mailed or distributed. The extent to which these policies impact and govern higher education in prisons varies quite a bit, and strengthening such policies is incredibly urgent with federal Pell grant funding set to resume for college students in prisons.

Ithaka S+R researchers therefore chose to review the national landscape of these policies rather than highlight individual case studies of prison censorship.

Taking a systematic approach, the researchers manually gathered media review policies and documents directly from DOC public websites, and in some instances, from third-party websites such as the Policy Clearinghouse. A major issue the project had to account for was the decentralized nature of the prison system in the United States and the circuitous character of DOC websites. For instance, while most media review policies were publicly available on web pages, many live within other document types such as handbooks. In some search cases, media review policies are spread out across documents, making it all the more challenging to discern the entirety of the policy.

Finding the policies then required hand searching published documents to consolidate and centralize procedures and policies state by state. After manually collecting documents, the research team assembled a sample of 62 documents. Next, each document was coded qualitatively for deeper analysis in NVivo. The coding process was broken down into three phases to parse out trends and commonalities in the documents to more granularly trace language, policies, and procedures and to specifically sort out coded terms specifically related to common types of censorship. For more on the coding process in NVivo, see the “Methods” section of the full report.

Hand gathering media review directives and documents directly from DOC and related websites in tandem with using a qualitative coding tool helped the researchers move beyond a more localized analysis of prison censorship policies to a wide-scale, national review of the issue. Given that the documents are almost always ephemeral in nature and that the landscape is constantly in flux, taking an automated approach to extracting this information would have been difficult. Collecting DOC media review directives and documents by hand rather than using a scraping methodology relied strongly on contextual understanding and expertise of departments of corrections and how they operate.

The resulting findings paint a fuller picture of which media review directives limit the vendors from which materials can be purchased and to what extent states limit access to materials with sexually explicit content. As a result, Ithaka S+R was also able to make evidence-informed policy recommendations essential to advocating for educational equity in carceral settings.

Looking ahead

During a time when automation and artificial intelligence are on the rise across the research landscape, approaches like smaller scale hand gathering of publicly available data might come across as tedious or even less efficient from a research design perspective. However, the three studies highlighted here point to when human curation at the front end of data collection is essential to answer applied social science research questions.

It is also important to recognize that even when leveraging automated or AI technologies to make research workflows more efficient, that human labor still remains. Our forthcoming project on Making AI Generative for Higher Education will involve several activities to characterize the adoption of that technology in universities. This includes tracking invocations of AI on college and university websites, incorporating a mixed methods approach that combines automated approaches to surfacing and randomly selecting university webpages for analysis, such as through the Google Custom Search Application Programming Interface, alongside a manual approach to refining the results. George Veletsianos, Royce Kimmons, and Fanny Bondah’s path breaking study of ChatGPT invocations on university websites provides a great example of how this approach can work and when it makes sense to apply this approach.

As we continue to experiment in this area we are eager to connect with others with similar interests. If you have any questions about the research methods discussed in this blog today, or if you are simply interested in and excited by methodological questions about hand-collected data vs. automated data collection, we look forward to hearing from you.

Topics:

Digital scholarship and data management

Research practices

Tags:

This work is licensed under a Creative Commons Attribution/NonCommercial 4.0 International License. To view a copy of the license, please see http://creativecommons.org/licenses/by-nc/4.0/.