On Seams, Seamlessness, and Methodology

Roger C. Schonfeld

Earlier this month, I encountered a thought-provoking talk by Tim Sherratt making the very strong argument that seamlessness should not be our only goal in designing digital library systems. The talk is a year old but it is well worth reading today. I thank Donna Lanclos for tweeting about it recently.

I have argued strongly that we need to reduce the barriers to the use of e-resources for the academic community (I’ve shared some of my thinking in both an issue brief and a presentation to a major group of publishers and platform providers). Library, publisher, and vendor systems introduce far too many stumbling blocks for no good reason, impeding research and driving users away from scholarly sources. Discovery and access has improved tremendously over the past two decades but it is also fundamentally broken today.

At the same time, there are places where efforts to create a simple user experience will lead to a misleading outcome or a methodological flaw. Sherratt’s piece offers a lovely illustration of this, presenting an illustration of the dramatic apparent spike in newspapers published in Victoria, Australia during World War I, as recorded in Trove. As he points out, this is not real but simply an artifact of libraries that “have chosen to invest in the digitisation of newspapers from the World War I period.”

Or, to take another example, when conducting computational analysis against large digital corpora, when is it possible to describe the dataset—both its contents and the methods of its creation – sufficiently to assess any skew in the findings? The “culturonomics” effort, to take just one example, described its underlying Google Books source, but is it reasonable to draw conclusions about the English language globally based on 4% of the books published in that language drawn from just 40 university libraries? Is there skew or is the corpus representative? How would we even know? Google’s ngram viewer provides a sophisticated set of search tools but not information on the “seams” of the dataset, such as how it was assembled and what bias may have been introduced as a result.

In reflecting on where seamlessness and simplicity should be foregrounded and where an acknowledgment of complexity and methodology should be encouraged, I would propose a set of principles for discussion.

Seamlessness is desirable in discovery and access for reading and research purposes. Of course, discovery services that are made available through a library should offer a transparent indication of their sources and contents, so that librarians can understand these tools and advise their user communities—a principle that NISO’s Open Discovery Initiative adopted as recommended practice 3.3.1. While researchers may not typically need access to this information themselves, it can be made available should the issue arise. And even in the absence of such full disclosure, a discovery corpus such as Google Books can provide a major advance in discovery for historians and other researchers. Artificial stumbling blocks should be systematically removed to meet researchers’ desire to discover and access content as readily as possible.

Increasingly we build and use corpora not only for discovery and access, but also for text mining, data mining, and other forms of computational analysis. This includes not only Google’s ngram viewer, but also the HathiTrust Research Center, JSTOR’s Data for Research, and Elsevier’s API, among others. When a digital corpus is made available for use as a primary source, it should provide as much information as possible to help researchers assess its suitability for the particular research project to be undertaken. This builds upon the need to describe what is missing from a collection before it is contributed to an archive (see 3.1.5) and is appropriate given that archival collections are almost by definition primary sources.

Seams like these take time and cost money for archivists and digital collection managers, but they are essential to the work of scholars. Seamlessness in discovery and access also takes time and adds costs for libraries, publishers, and vendors, but it saves the time and improves the experience of scholars. Resources are not unlimited and tradeoffs are frequently made. Still, both seams and seamlessness can be worthy and valuable.

Topics:

Data management

Libraries

Research practices

Tags:

Data mining

text mining