Linked Data: A Way Out of the Information Chaos and toward the Semantic Web
By Michael A. Keller (comment to author)
Discovery and access for students and faculty are conditioned by four factors, three of them dismal and one a mixed blessing. First, there are too many silos of information, each with its own organization, search/discovery mechanism, metadata “system” (aka “stovepipe”), and incomplete description of the content residing in it. Second, library/cybrary discovery mechanisms operate with too little precision (there is too much ambiguity) and inadequate recall (there is too little consistent, lasting, and culturally unencumbered linkage to related information objects). Third, particularly in the realm of library-generated or -curated metadata, our metadata is distant from the World Wide Web, meaning that it is not indexed and visible to all denizens of the web. The fourth factor—the mixed blessing—is, of course, Google and its competitors, all of which offer convenient search and other discovery methods over a seemingly broad universe of information assets, most of them web-based, obscuring the reality that many information resources are not indexed by them and are not accessible through them. In addition, web indexing and discovery environments such as Google do not easily offer the more subtle analytical methods that support the development of new hypotheses in all sectors and disciplines, new relationships among themes and cultures, and new encouragement to the fullest exploitation of the ideas that change our stewardship of Spaceship Earth.
A new approach to reducing the chaos of the stovepipes of metadata and the disparate gaggles of valuable information involves the use of simple, machine-readable statements of relationships among ideas, people, places, things, events, times, and information objects, whether physical (including the digital referents to them) or virtual. Those simple statements are made possible by Resource Description Frameworks (RDFs) and typically involve a subject, an object or descriptor, and a predicate. “The sky is blue” is an example of an RDF triple statement. The RDF triple links to information objects and even information services through Uniform Resource Identifiers (URIs).
Various attempts to address the information chaos defined by these four factors have been only partially effective. Federated searching, for instance, has worked well only for limited subsets—say, up to 100 or 150—of academically relevant metadata and databases. The approach of aggregators, essentially drawing in and “normalizing” metadata, results in limited sets and possibilities with access conditioned on commercial exchanges, that is, too far from universal public access to meta-information and thus to discovery.
In 2001 and subsequently, Tim Berners-Lee and his colleagues at CERN and Southampton described and promoted a theory of a Semantic Web, of information objects that “understand” themselves in an environment predicated neither on documents nor on metadata but on relationships.1 RDF triples and information objects linked to those statements of relationships together form Linked Data environments, not quite a Semantic Web environment but a step toward one. Some of the tools and methods that librarians have devised support the development of Linked Data environments very well. Accurate and consistent bibliographic and other metadata records using authority files for names and topical descriptors can be transcoded to RDF triple statements and links from them to URIs. This approach moves discovery away from metadata structures designed to manage physical assets and toward structures designed to expose relationships and provide links to documents. Linked Data could provide the antidote to the chaos and complexity of the current overabundant array of too simple search mechanisms with too little precision and too short recall of relevant results.
Semantic Web and Linked Data environments are in development in numerous institutions and collective projects. They build on the fundamental constructs of the Internet (programmers create algorithms supporting communication without concern for the underlying infrastructure of the network enabling the communication) and the World Wide Web (programmers and users work with sets of interconnected documents without concern for the computers storing and exchanging the documents).2 The Semantic Web allows programmers and users to refer to ideas, people, places, things, events, and real-world objects, including digital information objects, without concern for the underlying documents describing them—though linking to them, if desired. The web is using simple, pervasive, and massively enabling technologies such as HTTP, URLs, and HTML, and Linked Data will use the same or very similar technologies (HTTP, URIs in place of URLs, and RDF in place of HTML).
Those of us building or contributing to Linked Data environments and ultimately to the Semantic Web are identifying ideas, people, places, things, and events embedded in knowledge resources that our colleges and universities produce and consume. We are tying those facts together with named connections (the RDF triples). And we are, or should be, publishing those relationships, including the ties to the knowledge resources, as “crawlable” links on the web. Well-constructed ontologies are in use to help reduce ambiguity, to enable algorithmic coding of RDF triples, and to contribute to the quality (accuracy) of the expressions of the relationships in RDF triples.
The prescription for providing this new means for exploration and discovery, more effective ways of identifying and mapping relationships of facts to documents, and more efficient ways of sorting highly relevant materials from all others in the ocean of information resources—some superficial, some deep—involves
replacing stovepipe systems with a linked data grid of new metadata, probably by transcoding existing metadata to RDF triples;
using the grid of the RDF triples to focus precision and expand recall;
making the grid of the RDF triples crawlable, openly and freely available, and thus an integral component of the web; and
supporting user-generated RDF triples and annotations on ones provided by others.
Because Linked Data environments can (but need not) refer to physical entities, we in the library/cybrary world will be empowered to discard the old metadata structures and the costly and cumbersome methods of assembling them, releasing staff to improve the intellectual reach of our scholars and students, as well as providing new ways of integrating information across silos of content.
Semantic Web approaches in general and Linked Data methods specifically offer new opportunities for addressing the traditional and prevailing problems of too many silos of content, too many disparate modes of search and access, and too little precision and too much ambiguity in search results in the extreme environments of academic information resources intended to support and report on the research and teaching in large research enterprises. These opportunities build on the simple and powerful protocols driving the Internet and the web. Linked Data prototypes might also demonstrate new modes of discovery, navigation of complex information topographies, graphical user interfaces for exploration, and ways to customize discovery and access for users. We need to develop a large Linked Data prototype based primarily on metadata that we own or can reach and that will be transcoded to RDF triples and provided with URIs to associated documents, whether digital or physical, to test the efficacy and efficiency of these approaches.
1. See Tim Berners-Lee, James Hendler, and Ora Lassila, “The Semantic Web,” Scientific American, May 17, 2001, and James Hendler, Nigel Shadbolt, Wendy Hall, Tim Berners-Lee, and Daniel Weitzner, “Web Science: An Interdisciplinary Approach to Understanding the Web,” Communications of the Association of Computing Machinery, vol. 51, no. 7 (July 2008), pp. 60–69.
2. See, for instance, Kulttuurisampo (http://www.kulttuurisampo.fi).