Total pages: two or more
There are three readings that need to be read, please read through them and write summary and analysis according to the questions below for each reading. Please write each question separately by using 1) 2) 3), and also separately each reading with its title. Thank you so much!
- A brief summary of the key argument, problem, or issue
- Suggesting the significance of the piece (how it contributes to our understanding of this topic within our class’s broad study of human information interaction)
- Posing one or more questions that you would like to probe about this reading or any other combination of strategies to get the group discussion going
Introduction
Inherent in the concept of interactive information retrieval is the notion that we
interact with some search user interface (SUI) beyond the submission of an
initial query. Perhaps the most familiar SUI to many is the streamlined
experience provided by Google, but many more exist in online retail, digital
archives, within-website (vertical) search, legal records and elsewhere. Amazon,
for example, provides a multitude of different features that together make a
flexible, interactive and highly suitable gateway between users and products.
The aim of this chapter is to provide a framework for thinking about the
elements that make up different SUI designs, taking into account when and
where they are typically used.
Search: the way we usually see it
The SUI that many people now see daily is Google, and Figure 8.1 overleaf
shows the 14 notable SUI features it provides for users on its search engine
results page (SERP). The most common feature searchers expect to see is the
query box (#1 in Figure 8.1), which in Google provides a maintained context so
that the query can easily be edited or changed without going to the previous
page. Searchers are free to enter whatever they like, including special operators
that imply specific phrases or make sure certain words are not included. The
second most obvious feature is the display of results (#2), which is usually1
ordered by how relevant they are to the search terms. Results typically highlight
how they relate to the search terms by showing parts in bold font. Users are
typically able to view additional results using the pagination control (#3).
We also see many control and modifier SUI features. Google provides fixed
options across the top (#4) and relevant options down the left (#5) for
139
8
●●●
Interfaces for information retrieval
Max Wilson
Core terms of use, available at https://www.cambridge.org/core/terms. https://doi.org/10.29085/9781856049740.010
Downloaded from https://www.cambridge.org/core. The University of British Columbia Library, on 04 Feb 2022 at 23:34:23, subject to the Cambridge
INTERACTIVE INFORMATION SEEKING, BEHAVIOUR AND RETRIEVAL
140
specializing the search towards certain types of results. Further, Google allows
users to restrict their results (#6), or change how they are shown (#7). It is typical
for search engines to provide an advanced search to help define searches more
specifically (#8). Finally, most search engines provide recommendations for
related queries (#9).
Google also provides extra information, such as an indicator on the number
of results found (#10), and information about when you may have made an
error (#11). Finally, Google also provides personalizable features that are
accessible when signed in (#14), such as settings (#13) and information about
your prior searches (#12).
A starting framework for thinking about SUI designs
Broadly, we can break the elements of a SUI, like those discussed in the Google
example above, into four main groups:
• input features – which allow the user to express what they are looking for
• control features – which help users to modify or restrict their input
• informational features – which provide results or information about results
• personalizable features – which relate specifically to searchers and their
previous interactions.
Figure 8.1 Fourteen notable features in the Google search user interface
Core terms of use, available at https://www.cambridge.org/core/terms. https://doi.org/10.29085/9781856049740.010
Downloaded from https://www.cambridge.org/core. The University of British Columbia Library, on 04 Feb 2022 at 23:34:23, subject to the Cambridge
These groups are highlighted in zones in Figure 8.2 (input as 1 and 8, control as
4, 5, 6, 7 and 9, informational as 2, 3, 10 and 11, personalizable as 12, 13 and 14),
and will be revisited throughout the chapter as other search interfaces provide
different features in these groups. Often new SUIs or SUI features innovate in
one of these groups. Finally, it is important to note that these groups can
overlap. Informational features are often modified by personalizable features,
for example, and some features can act as input, control and informational
features.
Early search user interfaces
A brief early history
The roots of information retrieval systems are in library and information science.
In libraries, books are indexed by a subject-oriented classification scheme and
to find books we interact with the physical spaces, signposting, and librarians
within them. Yet the study of information retrieval was motivated by the
development of computers in the 1960s, which could automatically perform one
of the tasks that librarians do: retrieve a document (or book). The interface with
computers, however, was with punch cards at first, and then command lines
sometime after. Immediately, we can see the model kind of support we wanted
to provide to users (a librarian) but were so far limited by technology.
141
WILSON • INTERFACES FOR INFORMATION RETRIEVAL
Figure 8.2 The Google SUI zoned by the different types of feature categories
Core terms of use, available at https://www.cambridge.org/core/terms. https://doi.org/10.29085/9781856049740.010
Downloaded from https://www.cambridge.org/core. The University of British Columbia Library, on 04 Feb 2022 at 23:34:23, subject to the Cambridge
Conversation and dialogue
Given the user interface limitations, and the influence of librarianship, some of
the initial SUIs were modelled around conversations or ‘dialogues’. In
analysing, for example, the roles, questions and answers that took place in
conversations between visitors and librarians (Winograd and Flores, 1986), early
researchers developed question and answer style SUIs. Figure 8.3 shows an
early command-line dialogue-style system introduced in the 1970s (Slonim,
Maryanski and Fisher, 1978), which tried to help users describe what they were
searching for. These SUIs typically asked the searchers for any information they
already had about what they wanted, so that when it came to performing the
search (which could last a number of minutes or hours even) it was more likely
to return the correct result.
This conversational style was analysed for some time, and was also influenced
by those interested in artificial intelligence and natural language processing. As
technology improved and results were returned faster, the emphasis of the
conversational perspective moved towards modelling a continued dialogue
over multiple searches within interactive information retrieval. The MERIT
system (Belkin et al., 1995), for example, was designed based on a much more
flexible, continuing, conversation model.
Browsing
Another early type of system, still using command-line interaction, supported
‘browsing’. Similar to the initial dialogue-based systems, browsing systems
like the 1979 BROWSE-NET (Palay and Fox, 1980) in Figure 8.4 (on page 144)
presented different modes to scan through databases and provided options for
different ways of accessing the documents. Again, we see these browsing style
systems appear over the course of interactive information retrieval design,
although in 1983 research identified that people ‘browsed’ less on the early
online newsgroups. Geller and Lesk (1983) hypothesized that this may have
been because people often knew more about what was in a fixed dataset than
in the oft-changing web collection we have now. Despite this hypothesis, we
later saw the rise of website directories, like the Yahoo! Directory. Directories,
while still available, were never as successful as web search engines, perhaps
providing evidence for Geller and Lesk’s hypothesis. More recently, we see
browsing interfaces appear within individual websites, as discussed further in
the discussion of faceted browsing in the section ‘Faceted metadata’.
INTERACTIVE INFORMATION SEEKING, BEHAVIOUR AND RETRIEVAL
142
Core terms of use, available at https://www.cambridge.org/core/terms. https://doi.org/10.29085/9781856049740.010
Downloaded from https://www.cambridge.org/core. The University of British Columbia Library, on 04 Feb 2022 at 23:34:23, subject to the Cambridge
Form filling
As SUIs became more directly interactive, with the onset of commercially
available graphical user interfaces2 in the early 1980s, the common paradigm we
see today of ‘form filling’ became more popular. This advanced the conversational
response SUIs, which took input over time from a series of questions, by providing
143
WILSON • INTERFACES FOR INFORMATION RETRIEVAL
Figure 8.3 An early command-line dialogue-style system (Slonim, Maryanski and Fisher,
1978).
Copyright © 1978 ACM, Inc. doi>10.1145/800096.803134. Reprinted by
permission.
Core terms of use, available at https://www.cambridge.org/core/terms. https://doi.org/10.29085/9781856049740.010
Downloaded from https://www.cambridge.org/core. The University of British Columbia Library, on 04 Feb 2022 at 23:34:23, subject to the Cambridge
INTERACTIVE INFORMATION SEEKING, BEHAVIOUR AND RETRIEVAL
144
all the data entry fields spatially.
Although ‘form filling’ includes normal
keyword searching, this technique
allowed systems to present all the fields
that could be individually searched in a
way that we now commonly call an
advanced search. The EUROMATH
system, shown in Figure 8.5 designed by
McAlpine and Ingwersen (1989), has a
custom form highlighting all the fields
that can be searched individually or in
combination.
Boolean searching
One advance in the algorithmic technol –
ogies was to process Boolean queries, so
that we could ask for information about
‘Kings OR Queens’, and get a more
comprehensive set about, in this case,
Monarchs. This technological advance
was made before the majority of SUI
develop ments, as can be seen in Figure
8.5. The advent of GUIs, however,
provided an opportun ity to help people
construct Boolean queries more easily
and visually. The STARS system (Anick
et al., 1990), shown in Figure 8.6,
allowed users to organize their query in
a 2D space, where horizontal space represented ‘AND’ joins, and anything aligned
vertically were ‘OR’ joins. Like all these early ideas, Boolean searching is still
prevalent in our modern interactive information retrieval SUIs, including Google
(see Figure 8.1); the ‘-’ before the word is equivalent to a Boolean NOT, in this case.
Summary
The initial advances in information retrieval were typically made in
technological improvements. Consequently, these SUI advances in the early
days related mainly to the input SUI features, with the exception of some
advances (like the browsing and form filling) which provided information about
the structure of the data, making them also contribute to the informational SUI
Figure 8.4 An early browsing interface
for databases (Palay and Fox, 1981).
Copyright © 1980 ACM, Inc. Reprinted by
permission.
Figure 8.5 The EUROMATH interface
(McAlpine and Ingwersen, 1989).
Copyright © 1989 ACM, Inc.
doi>10.1145/75334.75341. Reprinted by
permission.
Core terms of use, available at https://www.cambridge.org/core/terms. https://doi.org/10.29085/9781856049740.010
Downloaded from https://www.cambridge.org/core. The University of British Columbia Library, on 04 Feb 2022 at 23:34:23, subject to the Cambridge
features. Other informational advances included simple highlighting in a result
where it matched the query, as shown in Figure 8.7 where the horizontal bar at
the bottom indicated where in a book any search terms appear. The onset of
GUIs meant that SUIs became more interactive, with Pejtersen’s fiction browser
(Pejtersen, 1989) presenting an explorable-world view of a bookshop, as shown
in Figure 8.8 overleaf. Pejtersen’s fiction bookshop allowed users to browse the
bookshop using different strategies, where the figures shown are engaging in each
strategy. We were not yet, however, engaging in what we now call interactive
145
WILSON • INTERFACES FOR INFORMATION RETRIEVAL
Figure 8.6 The STARS system (Anick et al.,1990).
Copyright © 1990 ACM, Inc. doi>10.1145/96749.98015.
Reprinted by permission.
Figure 8.7 Use of highlighting for terms that match a query (Teskey, 1988).
Copyright © 1988 ACM, Inc. doi>10.1145/62437.62481.
Reprinted by permission.
Core terms of use, available at https://www.cambridge.org/core/terms. https://doi.org/10.29085/9781856049740.010
Downloaded from https://www.cambridge.org/core. The University of British Columbia Library, on 04 Feb 2022 at 23:34:23, subject to the Cambridge
INTERACTIVE INFORMATION SEEKING, BEHAVIOUR AND RETRIEVAL
146
information retrieval, where we consider interactive information retrieval to be
the ongoing interaction over multiple searches to reach a goal, rather than the
single search that is still often considered in information retrieval.
The onset of modern interactive information retrieval SUIs
The onset of modern interactive information retrieval SUIs began around the
time that we first saw web search engines like AltaVista,3 but before Google
was launched. One of the first studies to demonstrate that there were
significant and specific benefits to interactive information retrieval, where users
actively engage in refining and submitting subsequent queries, was provided
by Koenemann and Belkin (1996). Using a query engine that was popular at
the time called INQUERY, Koenemann and Belkin built the RU-INQUERY SUI,
shown in Figure 8.9 (b). Searchers could submit a query in the search box at
the top left, and see a scrollable list of results on the right hand side. The current
query was then displayed in the box underneath the search box. The full text
of any selected result was displayed beneath the results on the right. The RU-
INQUERY interface had hidden, visible, and interactive relevance feedback
terms; the interactive terms provided the most effective support for users.
Figure 8.8 Pejtersen’s fiction bookshop (Pejtersen, 1989).
Copyright © 1989 ACM, Inc. doi>10.1145/75334.75340.
Reprinted by permission.
Core terms of use, available at https://www.cambridge.org/core/terms. https://doi.org/10.29085/9781856049740.010
Downloaded from https://www.cambridge.org/core. The University of British Columbia Library, on 04 Feb 2022 at 23:34:23, subject to the Cambridge
The experiment was built to leverage relevance feedback (discussed in Chapter
6, ‘Access Models’), which used key terms from the results marked as ‘relevant’,
using the check boxes, and added them to the search to get more precise results.
To demonstrate the benefits of interaction in information retrieval, three altern –
ative versions were developed:
• opaque – provided the typical relevance feedback experience that was
common at the time, where terms from the selected relevant documents
were added, but there was nothing in the SUI to display what those
additional terms were (Figure 8.9(b))
• transparent – provided a similar experience to the opaque version, except
that the added terms were made visible in the ‘current query’ box
• penetrable – allowed the users to choose additional terms from the relevant
documents; the keywords associated with the relevant documents were
listed in a separate box below the ‘current query’ box (Figure 8.9 (a)), and
could be added to the current query box manually.
While all three experimental versions provided improved support within a task-
based user study, the most interactive penetrable version provided statistically
significant improvements and did not significantly increase the time involved
in searching. When analysed according to the framework described in the
section ‘A starting framework for thinking about SUI designs’ above, this study
showed the initial value of having control SUI features that help people modify
and manipulate a search.
147
WILSON • INTERFACES FOR INFORMATION RETRIEVAL
Figure 8.9 The RU-INQUERY interface (Koenemann and Belkin, 1996).
Copyright © 1996 ACM, Inc. doi>10.1145/238386.238487.
Reprinted by permission.
(a) Penetrable condition
(b) Opaque condition
Core terms of use, available at https://www.cambridge.org/core/terms. https://doi.org/10.29085/9781856049740.010
Downloaded from https://www.cambridge.org/core. The University of British Columbia Library, on 04 Feb 2022 at 23:34:23, subject to the Cambridge
INTERACTIVE INFORMATION SEEKING, BEHAVIOUR AND RETRIEVAL
148
Modern search user interfaces and features
This section covers many of the more modern advances in SUI designs, and is
structured according to the framework described in the section ‘A starting
framework for thinking about SUI designs’. It begins by discussing input
features, before moving on to control, informational and personalizable features.
Input features
While there have been many technical advances in the processing of user
queries and matching them against documents, the plain white search box has
remained pleasingly simple. This section begins by examining the design of the
search box, before moving on to other input methods.
The search box
The search box pervades SUIs and searchers can feel at a loss when they do not
have a small white text field to spill their search terms into. The search box has
many advantages:
• Flexibility – It is extremely flexible (assuming the technology behind it is
well made), uses the searcher’s language and the searcher can be as
generic or specific as they like.
• An informational feature – As well as being primarily used as an input
feature, the search box can – and should – be used as an informational
feature. When not being used to enter keywords, the search box should be
informing the user of what is currently being searched for.
• The auto-complete function – This can help people avoid entering unproductive
search queries. By providing information to the user as they query, auto-
complete helps make the search box a better informational feature as well as
an input feature. Auto-complete can be rich with context, with the Apple
website providing images, short descriptions and even prices, as can be seen
in Figure 8.10 (a). Furthermore, auto-complete can be personizable, as with
Google in Figure 8.10(b), which shows queries the searcher has used before.
• Operators and advanced search – The keyword search box itself has only
really had minor visual changes, with some suggesting this may affect the
number of words people put in their query. Regardless, studies indicate
that searchers submit between two- and three-word queries (Jansen,
Spink and Saracevic, 2000; Kamvar et al., 2009), and around 10% of
searchers use special operators to block certain words or match explicit
phrases. Advanced search boxes, when implemented well, can help guide
people towards providing more explicit queries in the search box.
Core terms of use, available at https://www.cambridge.org/core/terms. https://doi.org/10.29085/9781856049740.010
Downloaded from https://www.cambridge.org/core. The University of British Columbia Library, on 04 Feb 2022 at 23:34:23, subject to the Cambridge
149
WILSON • INTERFACES FOR INFORMATION RETRIEVAL
(a) Apple – shows lots of contextual data (b) Google – prioritizing previous searches
Query by example
There is a range of searching systems that take example results as the input.
One example commonly seen in SERPs is a ‘More Like This’ button, which
returns pages that are related to a specific page. While these could be seen as
control examples, an example demonstrator called Retrievr 4 lets searchers
sketch a picture and returns similar pictures. Similarly, services like Shazam5
use recorded audio as a query to find music. Shazam and Retrievr are examples
that are explicitly query by example input features, while others can be seen as
input and/or control.
Adding metadata
While there have been some variations in how we enter information into a
search box, the alternative is typically to present useful and usable metadata to
the users as an input feature. The presentation and use of metadata in SUIs,
however, can be very hard to delineate in its contribution between input,
control, informational and personalizable features. Indeed, well designed use
of metadata can serve as a feature in each of these feature types. Presented on
the front page of a SUI, categories can, for example, allow the searcher to input
their query by browsing. If a searcher can filter their keyword search, or make
sub-category choices, then metadata can quickly become a control feature.
Further, if results are accompanied by how they are categorized, then metadata
Figure 8.10 Examples of auto-complete
Core terms of use, available at https://www.cambridge.org/core/terms. https://doi.org/10.29085/9781856049740.010
Downloaded from https://www.cambridge.org/core. The University of British Columbia Library, on 04 Feb 2022 at 23:34:23, subject to the Cambridge
can become an informational feature too; research has shown this to be popular
with searchers (Drori and Alon, 2003). Finally, it’s not beyond the realm of
possibility to highlight popular or previously used category options to make
them personalizable too.
Categories
Websites, including the Yahoo! Directory, often present high-level categories to
help users externalize what they are looking for. Several studies (Egan et al., 1989;
Dumais, Cutrell and Chen, 2001) have shown that categorizing results in SUIs
can help users to find results more quickly and more accurately. One key early
system called SuperBook, which automatically created a categorized index over
full-text documents, was shown to help people learn, as measured by quality of
short open-book essays (Egan et al., 1989). More recently, eBay and Amazon
provide searchers with higher level categories so that they can first define what
type of object they are looking for before browsing with richer metadata.
Clusters
One challenge for categories, especially for the whole web, is to categorize all
the data. Another approach, using clustering algorithms in the backend, is to
cluster results by key topics in their content. One early cluster system, called
Scatter/Gather, divided results into clusters of similar topics to highlight the
range of topics covered in a SERP. Evaluation of the Scatter/Gather approach
showed that searchers were easily and quickly able to identify groups of more
relevant documents compared with a standard SERP (Hearst and Pedersen,
1996).
A more recent system, Clusty 6 (Figure 8.11), embodies a clustering method
that creates automatic hierarchical clusters based on the results that are
returned, but is primarily used as a control feature. Despite some studies
showing evidence that clusters help searchers to search (e.g. Turetken and
Sharda, 2005), research has suggested that well designed carefully planned
metadata is better for SUIs than automatically generated annotations (Hearst,
2006a).
Faceted metadata
It has been popular to categorize results in multiple different ways, so that
searchers can express several constraints. Research has shown that, compared
with keyword search, faceted systems can improve search experiences in more
open-ended or subjective tasks (where no single right answer is available)
INTERACTIVE INFORMATION SEEKING, BEHAVIOUR AND RETRIEVAL
150
Core terms of use, available at https://www.cambridge.org/core/terms. https://doi.org/10.29085/9781856049740.010
Downloaded from https://www.cambridge.org/core. The University of British Columbia Library, on 04 Feb 2022 at 23:34:23, subject to the Cambridge
(Stoica and Hearst, 2004). The popular Epicurious website, for example, allows
users to describe recipes that they would like by several types of categories
(called facets), including cuisine, course, ingredient and preparation method.
While the first selection in a facet acts as an input, subsequent selections in facets
act as refinements, and can thus be considered as control.
The Flamenco interface 7 (Yee et al., 2003), shown in Figure 8.12, provides
151
WILSON • INTERFACES FOR INFORMATION RETRIEVAL
Figure 8.11 The Clusty system
Figure 8.12 The Flamenco interface
Core terms of use, available at https://www.cambridge.org/core/terms. https://doi.org/10.29085/9781856049740.010
Downloaded from https://www.cambridge.org/core. The University of British Columbia Library, on 04 Feb 2022 at 23:34:23, subject to the Cambridge
INTERACTIVE INFORMATION SEEKING, BEHAVIOUR AND RETRIEVAL
152
several different categories (called facets), which can be used in combination to
define a query. It was used to demonstrate the value of faceted browsing and
represents the standard faceted SUI design. Many variations have been
designed since. Typically, a range of hierarchical or linear facets are provided,
and users can make selections in one or more of them. In Flamenco, used facets
are removed from view, so that remaining facets can receive more attention,
and selections are placed in a breadcrumb list of choices. Removing used facets
provides an effective approach for quickly narrowing results. Other systems
like mSpace8 (schraefel et al., 2006) leave facets in place to encourage exploration
by quickly changing and comparing decisions. mSpace provides an advanced
faceted SUI where the order of facets implies importance and gaps from left to
right are highlighted. Figure 8.13 shows that the two clips in the far right column
are from 1975 and 1974, which would not normally be conveyed in faceted SUIs.
mSpace (and iTunes) facets are only filtered in a left to right direction, and
highlights have been shown to help searchers learn and discover related items
in the remaining unused facets (Wilson, André and schraefel, 2010). Other
systems, including mSpace and eBay, permit multiple selections within single
facets, so, for example, searchers can see results that relate to two price brackets.
Faceted categories are typically used within fixed collections of results, such as
within one website (typically called vertical search), as there must be common
attributes across all the data to categorize them effectively. Although researchers
have tried to apply facets to general web search (Kules, Kustanowitz and
Shneiderman, 2006), Google does not typically provide faceted search, except in
Google Shopping.9 In the narrower space of searching for products, there are
common factors like price and shop that can be easily applied to all of the results.
Figure 8.13 The mSpace interface
Core terms of use, available at https://www.cambridge.org/core/terms. https://doi.org/10.29085/9781856049740.010
Downloaded from https://www.cambridge.org/core. The University of British Columbia Library, on 04 Feb 2022 at 23:34:23, subject to the Cambridge
More detailed literature can be read about the design of faceted metadata
interactive information retrieval systems (Hearst, 2006b; Tunkelang, 2009).
Social metadata
The rise of social media websites has led to the popular use of socially generated
metadata, such as tags. Tag clouds are now familiar in many SUIs, with services
like Flickr allowing you to explore by popular tags.10 Further, the research
prototype called Mr Taggy11 (Figure 8.14) allows users to search the web using
different types of tags, separating out adjectives and nouns, collected from
Delicious. Studies of MrTaggy suggest that searchers can explore and learn
more when tag clouds are available (Kammerer et al., 2009). Research has also
shown that building tags into the display of SERP can help users to (control or)
change their search at later stages (Gwizdka, 2010). Other social metadata can
be used, but is often informational, such as what other searchers have found or
how results have been rated.
Control features
Control features can be considered particularly …
1
1
Introduction to Library
Databases
As they said in the Sound of Music, “let’s start at the very beginning.” In
this case, a definition and some history, leading up to the current state of
the database industry and its major players. Last, we’ll go over the recent
development known as “Discovery Services” and explain why those systems
are outside the scope of this book.
Electronic access to information by means of the Web is so pervasive
that we take it for granted. You have undoubtedly already used library data-
bases somewhere in your academic life, and either heard or tossed the word
“database” around yourself. But where did these “databases” come from?
Why are they important? What is a database, anyway? Let’s address that
last question first, and then find out where they came from.
What Is a Database?
The Oxford English Dictionary defines “database” as: “A structured set
of data held in computer storage and typically accessed or manipulated by
means of specialized software.” (So much for not using any part of a word
in its definition.) For “data,” let us substitute “information.” A database is
a way to structure, store, and rapidly access huge amounts of information
electronically. That “information” can be numerical or textual, even visual.
And as the Encyclopedia of Computer Science (2003) notes: “An important
feature of a good database is that unnecessary redundancy of stored data
is avoided.” The key concepts are structure (an organized way to store the
information, accomplished by tables, records, and fields, which are discussed
later in this chapter), efficiency (no redundancy), and rapid access (the abil-
ity to search and retrieve material from the database as quickly as possible).
C
o
p
y
r
i
g
h
t
2
0
1
5
.
L
i
b
r
a
r
i
e
s
U
n
l
i
m
i
t
e
d
.
A
l
l
r
i
g
h
t
s
r
e
s
e
r
v
e
d
.
M
a
y
n
o
t
b
e
r
e
p
r
o
d
u
c
e
d
i
n
a
n
y
f
o
r
m
w
i
t
h
o
u
t
p
e
r
m
i
s
s
i
o
n
f
r
o
m
t
h
e
p
u
b
l
i
s
h
e
r
,
e
x
c
e
p
t
f
a
i
r
u
s
e
s
p
e
r
m
i
t
t
e
d
u
n
d
e
r
U
.
S
.
o
r
a
p
p
l
i
c
a
b
l
e
c
o
p
y
r
i
g
h
t
l
a
w
.
EBSCO Publishing : eBook Collection (EBSCOhost) – printed on 2/4/2022 6:34 PM via UNIVERSITY OF BRITISH COLUMBIA
AN: 1197818 ; Bell, Suzanne S..; Librarian’s Guide to Online Searching: Cultivating Database Skills for Research and Instruction, 4th Edition : Cultivating
Database Skills for Research and Instruction
Account: s5672194.main.ehost
2 Librarian’s Guide to Online Searching
As my husband the computer scientist puts it, “a database isn’t magic but it
is pretty smart.”
Is Google a Database?
Absolutely, in the sense that Google is a vast collection of data (the con-
tents of web pages and material linked to web pages) that is searchable and
provides rapid access to results. But Google and similar web search engines
are not the focus of this textbook. These search tools build their databases
automatically, from material that is freely accessible on the web, and their
structure, scope, size, and many other aspects are not obvious. As far as one
can tell, there is no quality control, no human intervention involved in build-
ing the database.1
The commercial and governmental databases considered in this text are
products specifically crafted to achieve the goal of providing users access to
formally published information (e.g. articles, conference papers, books, dis-
sertations, reports), in a very organized and efficient fashion. (In the case of
commercial databases, part of that crafting is a mechanism for limiting access
to paid subscribers.) Let us refer to these as “library databases,” since that is
where you usually encounter them. Library databases tend to be targeted to
specific audiences, and to offer customized features accordingly. Their struc-
ture, scope, size, date coverage, publication list and many other details are
either obvious from their search interfaces, or explicitly provided. (I would
like to say that library databases are more structured than Google, but I can’t.
Because who knows how Google is structured or how its search algorithm re-
ally works? The “black boxness” of Google is another thing that distinguishes
it from the databases that are the focus of this text.) Library databases are
much less well known and ubiquitous than Google, and usually not free, but
there are good reasons for that. Finding out where these databases came from
should help explain why (as the adage goes, “you get what you pay for”).
Historical Background
Indexing and Abstracting Services
In the Beginning . . .
There was hard copy. Writers wrote, and their works were published
in (physical) magazines, journals, newspapers, or conference proceedings.
Months or years afterward, other writers, researchers, and other alert read-
ers wanted to know what was written on a topic. Wouldn’t it be useful if
there were a way to find everything that had been published on a topic, with-
out having to page through every likely journal, newspaper, and so forth?
It certainly would, as various publishing interests demonstrated: as early
as 1848, the Poole’s Index to Periodical Literature provided “An alphabeti-
cal index to subjects, treated in the reviews, and other periodicals, to which
no indexes have been published; prepared for the library of the Brothers in
Unity, Yale college” (Figure 1.1).
EBSCOhost – printed on 2/4/2022 6:34 PM via UNIVERSITY OF BRITISH COLUMBIA. All use subject to https://www.ebsco.com/terms-of-use
Introduction to Library Databases 3
Not to be caught napping, the New York Times started publishing their
Index in 1851, and in 1896, taking a page from Poole’s, the Cumulative In-
dex to a Selected List of Periodicals appeared, which soon (1904) became
the canonical Readers’ Guide to Periodical Literature. Thus, in the mid-19th
century, the hard copy Index is born: an alphabetical list of words, represent-
ing subjects, and under each word a list of articles deemed to be about that
subject. The index is typeset, printed, bound, and sold, and all of this effort
is done, slowly and laboriously, by humans.
Given the amount of work involved, and the costs of paper, printing, etc.,
how many subjects do you think an article would have been listed under?
Every time the article entry is repeated under another subject, it costs the
publisher just a little more. Suppose there was an article about a polar expedi-
tion, which described the role the sled dogs played, the help provided by the
native Inuit, the incompetence on the part of the provisions master, and the
fund-raising efforts carried on by the leader’s wife back home in England. The
Figure 1.1. Index listing and title page (inset)
from Poole’s Index to Periodical Literature.
Courtesy of the Department of Rare Books and
Special Collections, University of Rochester Li-
braries, Rochester, NY.
EBSCOhost – printed on 2/4/2022 6:34 PM via UNIVERSITY OF BRITISH COLUMBIA. All use subject to https://www.ebsco.com/terms-of-use
4 Librarian’s Guide to Online Searching
publisher really can’t afford to list this article in more than one or two places.
Under which subject(s) will people interested in this topic be most likely to
look? The indexer’s career was a continuous series of such difficult choices.
An index, recording that an article exists and where it would be found,
was a good start, but one could go a step further. The addition of a couple of
sentences (e.g., to give the user an idea of what the article is about) increases
the usefulness of the finding tool enormously—although the added informa-
tion, of course, costs more in terms of space, paper, effort, etc. But some index
publishers started adding abstracts, gambling that their customers would
pay the higher price (which they did). Thus, we have the advent of Abstract-
ing and Indexing services or “A & I,” terminology that you may still see in
the library literature.
The abstracts were all laboriously written by humans. They needed to
be skilled, literate humans, and skilled humans are very expensive (even
when they’re underpaid, they are expensive in commercial terms). Humans
are also slow, compared with technology. Paper and publishing are expensive,
too. Given all this, how many times do you think an entry for an article would
be duplicated (appear under multiple subjects) in this situation? The an-
swers are obvious; the point is that the electronic situation we have today is
all grounded in a physical reality. Once it was nothing but people and paper.
From Printed Volumes to Databases
Enter the Computer
The very first machines that can really be called digital computers were
built in the period from 1939 to 1944, culminating in the construction of the
ENIAC in 1946, “the first general-purpose, electronic computer” (Encyclopæ-
dia Britannica Online 2014). These machines were all part of a long pro-
gression of innovations to speed up the task of mathematical calculations.
After that, inventions and improvements came thick and fast: the 1950s and
1960s were an incredibly innovative time in computing, although probably
not in a way that the ordinary person would have noticed. The first ma-
chine to be able to store a database, RCA’s Bizmac, was developed in 1952
(Lexikon’s History of Computing 2002). The first instance of an online trans-
action-processing system, using telephone lines to connect remote users to
central mainframe computers, was the airline reservation system known
as SABRE, set up by IBM for American Airlines in 1964 (Computer His-
tory Museum 2004). Meanwhile, at Lockheed Missile and Space Company, a
man named Roger Summit was engaged in projects involving search and re-
trieval, and management of massive data files. His group’s first interactive
search-and-retrieval service was demonstrated to the company in 1965; by
1972, it had developed into a new, commercially viable product: Dialog—the
“first publicly available online research service” (Dialog 2005).
Thus, in the 1960s and 1970s, when articles were still being produced
on typewriters, indexes and abstracts were being produced in hard copy, and
very disparate industries were developing information technologies for their
own specialized purposes, Summit can be credited with having incredible
vision. He asked the right questions:
EBSCOhost – printed on 2/4/2022 6:34 PM via UNIVERSITY OF BRITISH COLUMBIA. All use subject to https://www.ebsco.com/terms-of-use
Introduction to Library Databases 5
1. What do people want? Information.
2. Who produces information, and in what form? The government
and commercial publishers, in the form of papers, articles, news-
papers, etc.
3. What if you could put information about all that published mate-
rial into a machine-readable file: a database—something you could
search?
Summit also had the vision to see how the technological elements could
be used. The database needed to be made only once, at his firm’s headquar-
ters, and trained agents (librarians) could then access it over telephone lines
with just some simple, basic equipment. The firm could track usage exactly
and charge accordingly. Think of the advantages!
The advantages of an electronic version of an indexing/abstracting sys-
tem are really revolutionary. In a system no longer bound by the confines of
paper, space, and quite so many expensive skilled personnel:
• Articles could be associated with a greater number of terms describing
their content, not just one or two (some skilled labor is still required).
• Although material has to be rekeyed (i.e., typed into the database),
this doesn’t require subject specialists, simply typists (cheap labor).
• Turnaround time is faster: most of your labor force isn’t thinking
and composing, just typing continuously—the process of adding to
the information in the database goes on all the time, making the
online product much more current.
• If you choose to provide your index “online only,” thus avoiding the
time delays and costs of physical publishing, why, you might be able
to redirect the funds to expanding your business: offering other in-
dexes (databases) in new subject areas.
As time goes on, this process of “from article to index” gets even faster.
When articles are created electronically (e.g., word processing), no rekeying
is needed to get the information into your database, just software to convert
and rearrange the material to fit your database fields. So, rather than typ-
ists, you must pay programmers to write the software, and you still need
some humans to analyze the content and assign the subject terms.
In the end, the electronic database is not necessarily cheaper to create;
it very likely costs more! The costs have simply shifted. But customers buy
it because . . . it is so much more powerful and efficient. It is irresistible, and
printed indexes have vanished like the dodo. Online library databases are
an integral part of the research process.
The Library Database Industry Today
For a line of business and a product you probably weren’t very aware
of until you were in high school or college, the library database business is,
EBSCOhost – printed on 2/4/2022 6:34 PM via UNIVERSITY OF BRITISH COLUMBIA. All use subject to https://www.ebsco.com/terms-of-use
6 Librarian’s Guide to Online Searching
for the moment, surprisingly robust. The juggernaut of Google and espe-
cially Google Scholar has not put the commercial database vendors out of
business (yet—I’m sure there is a constant undercurrent of fear throughout
the business). Probably the largest commercial vendors, and ones you might
have heard of before reading this book, are EBSCO, ProQuest, and Gale
(Gale Cengage). Other major names to add to your repertoire are Thomson-
Reuters (creators of the Web of Science and many other databases), JSTOR,
LexisNexis, OCLC FirstSearch, ABC-CLIO, Alexander Street Press, Project
MUSE, and OVID. Most of the databases produced by these vendors have
content drawn from many sources, many publishers: they aggregate content,
bringing it together so you can search across all of it in one database. Thus
the terminology aggregators that is often used to describe the multidisci-
plinary article databases from the vendors listed above. In contrast, major
publishers such as Elsevier, Oxford University Press, and Sage Publications
are big enough to create databases just of the materials they publish, for ex-
ample: Elsevier’s ScienceDirect database, Oxford Music Online, Sage Jour-
nals and the CQ databases (an imprint of Sage).
In addition to the commercial entities mentioned above, some profes-
sional associations create and manage the subscriptions to databases of
their materials. Examples include the Association for Computing Machin-
ery (ACM), the Institute of Electrical and Electronics Engineers (IEEE), the
American Society of Mechanical Engineers (ASME), the American Math-
ematical Society (AMS), and the American Chemical Society (ACS). Govern-
ment and international agencies also produce databases. US government
agencies such as the National Library of Medicine, the Department of Edu-
cation, the Census Bureau, and the Bureau of Labor Statistics are the au-
thors of key databases in their respective topical areas, which we will cover
in subsequent chapters. At the international level, the World Bank,2 the In-
ternational Monetary Fund, and the Organization for Economic Cooperation
and Development (OECD) all offer databases of their information.
The names of library database vendors listed above represent only the
largest and/or better-known entities. As in any line of business there are,
of course, many more companies, either smaller or focused on a particular
audience (the number of vendors that create databases specifically for the
business community, both corporate and academic, is remarkably exten-
sive). The database vendor industry is also a business like any other: it
is subject to consolidation and occasionally to expansion. Companies come
and go through mergers and acquisitions, start-ups, and occasional deaths.
Changes may not happen as rapidly as in some industries, but when they
do, they can be significant. Three of the notable changes in the current
decade were EBSCO’s acquisition of the H.W. Wilson databases, and two
major moves by ProQuest: the acquisition of the CSA databases and taking
over publication of the Statistical Abstract of the United States from the US
Census Bureau, including putting all the Statistical Abstract content into
a new database.
At the beginning of this section, I made a reference to the database
vendors’ (not to mention librarians’) fears about Google and Google Scholar:
that these free, ubiquitous, embedded-in-daily-life resources might spell
the end of the library database business. The vendors have been fighting
EBSCOhost – printed on 2/4/2022 6:34 PM via UNIVERSITY OF BRITISH COLUMBIA. All use subject to https://www.ebsco.com/terms-of-use
Introduction to Library Databases 7
back for many years, however, first with something called “federated search”
(about which the less said the better; the title of Jody Fagan’s 2011 editorial
on the topic says it all: “Federated Search Is Dead—and Good Riddance!”).
The latest counter-attack by the database vendors, dubbed “Discovery Ser-
vices” or “Web Scale Discovery Services,” is far superior. Reports on usage
statistics from institutions that have adopted a discovery service indicate
that these products may have a strong chance of winning ground back from
the all-mighty Googleplex (Way 2010, Kemp 2012, Daniels, Robinson, and
Wishnetsky 2013, Calvert 2014).
The following section will provide a brief overview of discovery services,
concluding with why they will not be considered further in this text.
Discovery Services
Discovery Services are systems that harvest and pre-index a wide vari-
ety of library content from separate sources (records from library databases,
the online catalog, perhaps the local institutional repository or other locally
developed databases), build one giant index of all that content, and provide
near-instant, relevancy ranked results through one search box (Vaughan
2011, Adams et al. 2013). Sound familiar? It is exactly the Google model, but
instead of web pages it draws on all the vetted and expensive resources for
which the library has already paid, making them “discoverable.” These sys-
tems are frequently referred to as Web-Scale Discovery Services, “meaning
they search library collections the way Google searches the web: by search-
ing the entire breadth of content available in the library’s collection” (Fry
2013). (The “entire breadth of content” is at least the goal if not the reality
right now.) The essential key is the pre-indexing, getting the data from all
the disparate resources ahead of time, as it were, to build that one giant
index that can provide the speedy response time that users expect. Where
the discovery systems start to part ways with Google is on the results page,
which is loaded with options for refining and outputting results, and where
library-owned full text is instantly accessible.
The vendors and products in the discovery service market at the time of
this writing are EBSCO Discovery Service (EDS), Serials Solutions’ Summon
(note that Serials Solutions is owned by ProQuest), Ex Libris’ PrimoCentral,
OCLC’s WorldCat Discovery Services (WDS), and, though it works differ-
ently from the others, Innovative Interface’s Encore Synergy. AquaBrowser
is ProQuest’s discovery product aimed at the public library market.
The tricky part is that these companies are competitors both in the
individual database and now in the discovery service market. Their major
customers (large academic libraries) have resources from a wide variety of
vendors. Achieving the goal of providing “one search” access to all that con-
tent means that each discovery service company (A) must persuade the com-
peting discovery service companies (B), and all the other database vendors
(C), to give A access to their databases in order to harvest and pre-index
the data therein. This is a delicate dance, as you might imagine, but again,
the threat of Google is actually helping, and agreements are (carefully) be-
ing negotiated. From a customer’s point of view, it’s obvious: “You’ve got to
be in,” says Michael Kucsak, Director of Library Systems & Technology at
EBSCOhost – printed on 2/4/2022 6:34 PM via UNIVERSITY OF BRITISH COLUMBIA. All use subject to https://www.ebsco.com/terms-of-use
8 Librarian’s Guide to Online Searching
the University of North Florida, talking about inter-vendor discoverability.
“You’re in—you win. You’re out—you’re not long for the world” (Fry 2013).
Discovery services hold immense promise for breaking down the silos in
library content, especially the one between the library catalog (OPAC) and
the article databases (each one of which is in its own silo). While students
may eventually understand that the catalog and the databases are separate,
and that the routes for accessing each one are different, for the casual user
who needs 15 good articles for tomorrow’s paper—it is simply too much ef-
fort. Discovery services meet that need fairly efficiently and painlessly. And
as a librarian who has made many, many purchase decisions and is painfully
aware of what quality database resources cost, a “tool [that] holds the poten-
tial to significantly increase the discovery and use of such content” (Vaughan
2011) does indeed get my notice and my vote.
So how can a textbook on (individual) database searching still be justi-
fied? Why master all sorts of esoteric knowledge and get comfortable with
interfaces having three search boxes (with attendant options and settings)
when there is a simple, one-box option that searches the same material? The
discovery services tools are a wonderful way to woo undergraduates back to
library resources. But you have this textbook in hand, presumably, because
you are studying to become a librarian or an information professional or
technologist. For you, a higher order of knowledge and familiarity with more
sophisticated tools and approaches is one of the essential points—otherwise
anyone could set up shop and call herself an expert searcher. Google and
the discovery services will take care of the lower order questions. Someone
still needs to be there to deal with the harder, higher order research queries.
When the discovery service search isn’t providing the answer, someone needs
to know how to go to next level: how to choose, access, and skillfully interact
with highly crafted, subject-specific databases on an individual basis. Jody
Fagan (2011) points out that “scholars working on more substantial research
projects . . . have already found—or will need to find—the native interface
to the subject-specific resources they need.” You need to be the person who
can point those scholars to subject-specific resources, and help them get the
most out of the “native interface” (which usually provides subject-specific fea-
tures) of those resources.3 This book is designed to do precisely that. Let’s get
started—because searching really can be just as rewarding as finding.
Notes
1. According to the Google Guide at http://www.googleguide.com/google_works.html,
the Google database is built by the GoogleBot, indexed by the Google Indexer, and
searches handled by the three parts of the Query Processor. The utterly massive
scale simply precludes any kind of human involvement.
2. Worth noting, the World Bank databases, formerly subscription-based, are now avail-
able to the world for free. Kudos to the World Bank for this daring and generous move!
3. Besides, it’s just ever so much more interesting. What fun is plunking words in a
box? Trust me, database skills make research much more efficient and satisfying.
EBSCOhost – printed on 2/4/2022 6:34 PM via UNIVERSITY OF BRITISH COLUMBIA. All use subject to https://www.ebsco.com/terms-of-use
9
2
Database Structure
for Everyone: Records,
Fields, and Indexes
Whether you are using this book in an upper-level database searching
course or an entry-level intro to reference course, it’s likely that you have or
will have the opportunity to take a true “database” course. This means that
some of you may already be familiar with the concepts in this chapter. My
goal is to focus on helping you learn and develop strategies to search and
interact more effectively with library databases rather than getting into the
real technology of how databases are built. This chapter provides a brief and
simple introduction to how databases are conceptually put together. In my
experience, this is as much as you need to know to apply appropriate search
techniques and use the database effectively. There’s no point in piling on
technical detail if it doesn’t further your ultimate goal, which in this case is
searching.
Database Building Blocks
Fields, Records, and Tables
In essence, databases are made up of fields and records. Fields are like
one cell in an Excel spreadsheet: a bit of computer memory dedicated to hold-
ing one particular type of information, one value. For example, an age field
might hold the value 28. The type of information could be text, numbers, or
an image. A set of fields makes up a record, the idea being that the informa-
tion in all the fields of one record relate to one thing: a person, a company, a
journal, a purchase order, etc. An analogy would be a row in Excel: one row
equals one record. But while you could have an Excel file with 5000 rows
C
o
p
y
r
i
g
h
t
2
0
1
5
.
L
i
b
r
a
r
i
e
s
U
n
l
i
m
i
t
e
d
.
A
l
l
r
i
g
h
t
s
r
e
s
e
r
v
e
d
.
M
a
y
n
o
t
b
e
r
e
p
r
o
d
u
c
e
d
i
n
a
n
y
f
o
r
m
w
i
t
h
o
u
t
p
e
r
m
i
s
s
i
o
n
f
r
o
m
t
h
e
p
u
b
l
i
s
h
e
r
,
e
x
c
e
p
t
f
a
i
r
u
s
e
s
p
e
r
m
i
t
t
e
d
u
n
d
e
r
U
.
S
.
o
r
a
p
p
l
i
c
a
b
l
e
c
o
p
y
r
i
g
h
t
l
a
w
.
EBSCO Publishing : eBook Collection (EBSCOhost) – printed on 2/4/2022 6:35 PM via UNIVERSITY OF BRITISH COLUMBIA
AN: 1197818 ; Bell, Suzanne S..; Librarian’s Guide to Online Searching: Cultivating Database Skills for Research and Instruction, 4th Edition : Cultivating
Database Skills for Research and Instruction
Account: s5672194.main.ehost
10 Librarian’s Guide to Online Searching
(records), and 30 columns (all the different fields), such a file wouldn’t ulti-
mately be very efficient to search, and definitely isn’t scalable (it would not,
actually, be a database, but only a “flat file”). Enter the idea of relational da-
tabases, which are structured with tables. It’s like having many Excel work-
sheets that can have indefinitely many rows, but only a few columns (fields).
One of the fields in every table is dedicated to a unique identifier, which ties
Figure 2.1. DMV relational database example: tables, fields, and complete record.
Personal Dat<
ID Number
12345678
23456789
98765432
[etc.]
3 Table
Last Name
Smith
Jones
K e p l e r
First Name
John
Martha
J o h n
M l
Q
A
T
DOB
19451121
19950401
19620714
Gender
M
F
M
Eye Table
ID Number
12345678
23456789
98765432
[etc.]
Eye color
Blue
Brown
Grey
Corrective Lenses
Y
N
Y
Address Tabk
ID Number
12345678
23456789
98765432
3
Street
123 Main St
60 Merriman St
238 Bayview Dr
City
Clyde
Rochester
Greece
State
NY
NY
NY
Zip
14433
14607
14612
[etc.]
Photo Table
ID Number
12345678
23456789
98765432
[etc.]
BadPic Driving Histo
ID N u m b e r
12345678
23456789
98765432
ry Table
Years D r i v i n g
53
3
34
Accidents
2
1
0
[etc.]
“Show me the complete record for J o h n T. Kepler”
I D #
98765
432
Last
N a m e
Kepler
First
N a m e
John
MI
T
DOB
19620
714
Gen.
M
Street
238
Bayview
Dr
City
Gre
ece
Sta
te
NY
Zip
146
12
Eyes
Grey
Lens
es
Y
Yrs
Drv
34
Ace.
0
PicpIC
EBSCOhost – printed on 2/4/2022 6:35 PM via UNIVERSITY OF BRITISH COLUMBIA. All use subject to https://www.ebsco.com/terms-of-use
Database Structure for Everyone: Records, Fields, and Indexes 11
together all the material relating to the same person, company, etc. together.
All of that material now represents a record. The table structure (and some
additional features we will touch on presently) make possible the desired
storage efficiency and speed of access even for huge amounts of information.
Think about driver’s licenses. They all have an ID number, the owner’s
name, address, date of birth, eye color, a bad photo, etc. All of that informa-
tion undoubtedly resides in a database administered by the state agency
that cares about driver’s licenses. It’s easy to imagine the Department of
Motor Vehicle’s database having fields with names such as ID #, Name, Addr,
DOB, Eyes, BadPic, etc. The fields are probably located in several tables: one
for address information, one for driving history, one for the photo, etc. Pulled
together by the ID number field, those fields make up records, each one of
which represents a person (Figure 2.1).
The fields in the complete record represent every bit of information that
appears on your license, and probably some that isn’t actually printed on
the license as well. When you send in the paperwork and the check to renew
your license, they look you up in the database by your ID number, make any
changes that you might have indicated in your paperwork (e.g., change the
values in your fields), and hit print. Presto, you’ve gone from being a data-
base entry to being a small card with an unflattering photo.
Decisions, Decisions: Designing the Database
From here on, I’m going to discuss databases only in terms of fields and
records, leaving the “tables” aspect out. In the real world, yes, what is behind
the interface you are looking at is almost undoubtedly a relational database,
built on tables. But those tables are simply fields that make up mini-records.
At essence what matters are the fields, and how many of them you need to
create a complete record.
And indeed, the crucial task in developing a database is deciding what
fields the records in your database are going to have, and how big they are
going to be, that is, how many characters or numbers they will be able to
hold. This “size” represents the computer memory allocated every time a
new record is added. (Although memory is cheap now, in a huge project,
how much memory will be allocated is still something to consider.) In the
best of all possible worlds, a whole design team, including software engi-
neers, subject experts, people from marketing and sales, and potential us-
ers, would wrestle with this problem. Nothing might ever get done in such
a large and varied group, however, and so probably a more limited team of
software engineers and content experts is the norm. The problem is that the
design team had better make good choices initially, because it can be dif-
ficult, if not impossible, to make significant changes to the record structure
later.1 This is good and bad. It means there’s a certain inherent stability, or
at least pressure on these database products not to change too much, but
when you wish that they would fix something, it can take a long time for
change to happen. You can take a certain amount of comfort, though, in the
knowledge that however much the interface to the database—the way it
looks—changes, behind the scenes the same types of information (fields),
are probably still there.
EBSCOhost – printed on 2/4/2022 6:35 PM via UNIVERSITY OF BRITISH COLUMBIA. All use subject to https://www.ebsco.com/terms-of-use
12 Librarian’s Guide to Online Searching
Food for Thought
For an article database, you’d probably have a field for the article title,
the name of the journal it appeared in . . . and what else? Think about the
other information you would want to capture. Again, the process is some-
thing like this:
Define your fields (Figure 2.2):
. . . that make up records (Figure 2.3) . . .
. . . and form the basis of your database.
Quick Recap
In this section we have described the structure of databases in very
simple terms and compared it to the structure of an Excel spreadsheet. The
most basic elements of a database are fields and records. (Technically, the
fields are usually structured in the form of tables, with one field in each
table acting as the “unique key” to pull all the information relating to one
record together.) A full set of fields makes up a record. Every record in the
database has the same set of fields (even if, in some records, some fields are
blank). All of the records together make up the database.
Beyond Fields and Records
Field Indexes
Fields and records are the basis, the “data” of a database. What makes a
database fast, powerful, and efficient are the indexes of the fields. It would be
very slow if every time you queried the database, it started at field1, record1,
and searched sequentially through each field of each record—you might as
well go back to hard copy at that rate.
An index, in the sense that we’re discussing now, is a list of all the
values from a particular field, with some kind of identifier indicating from
which record each value came (a pointer if you will). This is much like the
Figure 2.2. Database fields.
Figure 2.3. Database records.
EBSCOhost – printed on 2/4/2022 6:35 PM via UNIVERSITY OF BRITISH COLUMBIA. All use subject to https://www.ebsco.com/terms-of-use
Database Structure for Everyone: Records, Fields, and Indexes 13
way the index at the end of a book indicates on which pages a word appears.
In one sense, creating indexes of fields breaks one cardinal rule of databases:
not to duplicate any data. But this one kind of duplication is worth the re-
dundancy and extra storage space, because combined with sophisticated
algorithms indexes make it possible to locate and retrieve the records as-
sociated with specified values in nanoseconds. The field indexes become part
of the database but have a separate existence from the records. (You could
think of them as really minimal table structures: just two columns, one with
the values for field X, and the other containing a pointer back to the record
that each field X value came from.) Again, the power of an index is that it
can be sorted and in other sophisticated ways optimized for searching.
Let’s return to the driver’s license example. It has a field for Last Name.
You’d definitely want to create an index to that field, so you’d have your com-
puter program harvest all the values from the Last Name field, along with
the associated ID Number value for each one. Given that the data is textual
(a name), you’d probably want to sort the index alphabetically.2 Then if you
wanted to find the record for Smith, John, your computer program could zip
to the Ss in the Last Name index list (and then to the Js in the First Name
index), find a set where the ID Numbers matched, and based on that pull up
the full record for Mr. Smith. And do this all in much less time than it takes
to write about it. Using indexes to find records also means that the order of
the rows in your database, that is, what order you enter your records, doesn’t
matter at all. You simply build an index and search that when you want to
find something in your database. Or you can build several indexes; you can
make an index of any field you want. However, as always, there are costs and
reasons why you might not index every field.3
A Very Simple Example
Say we have three articles:
Milky Way’s Last Major Merger.
Science News. v. 162 no. 24 p. 376
It’s a Dog’s Life.
The Economist. December 21, 2002. p. 61
Manhattan Mayhem.
Smithsonian. v. 33 no. 9 p. 44
Let’s enhance these just a little by adding a one-line description to each re-
cord (so that we have a few more words to search on):
Record 1:
Milky Way’s Last Major Merger.
Science News. v. 162 no. 24 p. 376.
New clues about galaxy formation indicate early collision affected Milky
Way’s shape.
EBSCOhost – printed on 2/4/2022 6:35 PM via UNIVERSITY OF BRITISH COLUMBIA. All use subject to https://www.ebsco.com/terms-of-use
14 Librarian’s Guide to Online Searching
Record 2:
It’s a Dog’s Life.
The Economist. December 21, 2002. p. 61.
From hard labour to a beauty contest, a history of the work and whims of
dog breeding.
Record 3:
Manhattan Mayhem.
Smithsonian. v. 33 no. 9 p. 44
Martin Scorsese’s realistic portrayal of pre–Civil War strife—Gangs of New
York—re-creates the brutal street warfare waged between immigrant
groups.
My database will have just four fields (Figure 2.4):
1. Record number (four-number places, e.g., my database will never
grow to more than 9,999 articles)
2. Article title (50 characters allocated)
3. Journal name (50 characters allocated)
4. Abstract (200 characters allocated)
Now let’s index the fields.
The initial list of words from the Article Title field looks like this:
Milky
Way’s
Last
Major
Merger
It’s
a
Dog’s
Life
Manhattan
Mayhem
Figure 2.4. A very simple database record.
EBSCOhost – printed on 2/4/2022 6:35 PM via UNIVERSITY OF BRITISH COLUMBIA. All use subject to https://www.ebsco.com/terms-of-use
Database Structure for Everyone: Records, Fields, and Indexes 15
More Database Decisions
There are various things about this list that one might question. What
will our indexing program do with those possessives and contractions? Do
we want to clog it up with little words like a? There are many decisions for
database designers to make:
• How will the indexing program handle apostrophes and other punc-
tuation? We take it for granted now that the system will simply pre-
serve it, and users can search for contractions or possessives, but
you may still encounter systems that insert a space instead of the
apostrophe (dog s), or ignore it and treat the letters as a string (end-
ing up with “dogs” for “dog’s”).
• What will the indexing program do with the “little words”? That is,
words such as a, an, by, for, from, of, the, to, with, and so forth, which
are usually referred to as stop words. These are words that are so
common that database designers usually decide they don’t want to
expend time and space to index them. Indexing programs are pro-
grammed with a list of such words and will “stop” indexing when they
hit a word on the list. A more descriptive term would be skip words,
because that is what really happens: the indexing program skips any
stop list words and continues to the next word. Almost all databases
employ a stop word list, and it can vary greatly from one vendor to the
next. (Even Google has stop words, words it doesn’t index.)
• Should the system be designed to preserve information about capi-
talization, or to ignore the case of the words? We are so used to sys-
tems that do not distinguish upper and lowercase (so that you don’t
have to worry how you type in your query), but there are times when
you would really like the system to know the difference between, say,
AIDS (the disease) and aids (the common noun or verb).
Because this is a modern system, we’ll decide to preserve the apostro-
phes and to make a one of our stop words, so it won’t be included in the
index. We can then sort the list alphabetically:
Dog’s
It’s
Last
Life
Major
Manhattan
Mayhem
Merger
Milky
Way’s
EBSCOhost – printed on 2/4/2022 6:35 PM via UNIVERSITY OF BRITISH COLUMBIA. All use subject to https://www.ebsco.com/terms-of-use
16 Librarian’s Guide to Online Searching
Can you see the problem here? We have neglected to include an identi-
fier to show which record a word came from. Let’s start over.
Better Field Indexing
Let’s make sure that our index list includes the record number and
which field the word came from:
0001 Milky TI
0001 Way’s TI
0001 Last TI
0001 Major TI
0001 Merger TI
0002 It’s TI
0002 Dog’s TI
0002 Life TI
0003 Manhattan TI
0003 Mayhem TI
One more thing: we can include a number representing the order of the
word within the field (why might this be useful?). We now have something
like this:
0001 Milky TI 01
0001 Way’s TI 02
0001 Last TI 03
Now we’ll sort again.
0002 Dog’s TI 03
0002 It’s TI 01
0001 Last TI 03
0002 Life TI 04
0001 Major TI 04
0003 Manhattan TI 01
0003 Mayhem TI 02
0001 Merger TI 05
0001 Milky TI 01
0001 Way’s TI 02
Note how even though we deleted the stop word a in the title “It’s a
dog’s life,” the numerical position of “dog’s” reflects that there was an inter-
vening word there: its position is recorded as 3, not 2.
EBSCOhost – printed on 2/4/2022 6:35 PM via UNIVERSITY OF BRITISH COLUMBIA. All use subject to https://www.ebsco.com/terms-of-use
Database Structure for Everyone: Records, Fields, and Indexes 17
Because people might want to search on the name of the publication,
it would be good to index that as well. Our index of the Journal Name field
looks something like this:
0002 Economist JN 02
0001 News JN 02
0001 Science JN 01
0001 Science News JN 01, 02
0003 Smithsonian JN 01
Note the multiple indexing of Science News. The technical term for this
is double posting.
To make things even faster and more efficient, after indexing each field,
combine the indexes so that you have only one list to search:
0002 Dog’s TI 03
0002 Economist JN 02
0002 It’s TI 01
0001 Last TI 03
0002 Life TI 04
0001 Major TI 04
0003 Manhattan TI 01
0003 Mayhem TI 02
0001 Merger TI 05
0001 Milky TI 01
0001 News JN 02
0001 Science JN 01
0001 Science News JN 01, 02
0003 Smithsonian JN 01
0001 Way’s TI 02
We undoubtedly want to index the content of the one-sentence “ab-
stracts,” as well. Here is a list of the words in raw form:
new beauty Pre-Civil
clues contest War
about a Strife
galaxy history Gangs
formation of Of
indicate the New
EBSCOhost – printed on 2/4/2022 6:35 PM via UNIVERSITY OF BRITISH COLUMBIA. All use subject to https://www.ebsco.com/terms-of-use
18 Librarian’s Guide to Online Searching
early work York
collision and Re-creates
affected whims The
Milky of brutal
Way’s dog street
Shape breeding warfare
From Martin waged
hard Scorsese’s between
labour realistic immigrant
to portrayal groups
a of
Decisions and cleanup are needed on this list of words:
• Stop words—what will they be?
• Hyphenated words—how will they be recorded?
• Proper names—“double post” to include the phrase too?
• Alternative spellings—do we do anything about them or not? (What
might you do?)
Luckily, software does almost all of this work for us. You probably will
never see any indexes in their raw state. What we’ve been going over here is
in real life very under the hood, often proprietary material for the database
vendors. You don’t need to know exactly how any particular database works;
you simply need to grasp some of the basic principles that govern how data-
bases in general are put together and how they are indexed. This determines
how you search them—and what you can expect to get out of them.
Quick Recap
This section discussed the idea of field indexes and the importance
of good planning in the design of huge databases. Field indexes refer to
the idea that the values in a database’s fields can be extracted and put
into their own lists that consist of just the value and a pointer back to
the record it came from. These indexes exist separately from the records
in the database, and make rapid, efficient searching of huge databases
possible. Much thought goes into the initial database design (i.e., what
fields to include, what they are called, how much space to allocate for each
one), because the design cannot be easily changed later. Many decisions
go into the design of indexes as well, for example, which fields will be in-
dexed, how contractions and possessives will be handled, which words will
be treated as stop words, and if and how identification of phrases will be
supported.
EBSCOhost – printed on 2/4/2022 6:35 PM via UNIVERSITY OF BRITISH COLUMBIA. All use subject to https://www.ebsco.com/terms-of-use
Database Structure for Everyone: Records, Fields, and Indexes 19
Examples of Indexes in Common Databases
In the examples that follow, see if you can relate what we’ve just gone
over with how the field indexes are presented to you as a user in these
common databases. We’ll start with two multidisciplinary databases: Aca-
demic OneFile from Gale, which has a single Subject list, and EBSCO’s
MasterFILE Premier, which offers separate Subjects, Places, and People
indexes. Last, we’ll consider the very elaborate indexing used by OCLC’s
WorldCat.
Gale’s Academic OneFile: One Subject List
Academic OneFile, one of the Gale Company’s “Infotrac” suite of data-
bases, prominently offers a Subject Guide Search. If you choose the Subject
Guide Search from the navigation bar, useful Search tips are displayed on
that interface page (Figure 2.5). The tips text suggests using this search
mode “when you want to browse a dynamic list of topics, people, products,
locations, organizations and more.”
Once you have searched for a term, then you can browse forward
through the list, as long as there are headings containing your search term
somewhere within them. But the Gale system doesn’t offer unlimited, free-
form browsing capability, unlike MasterFILE’s true browse access (i.e. Mas-
terFILE presents you with the very beginning of whichever index list you
choose, and you could, if you wanted, simply page through—browse—the
whole thing without any searching at all). An advantage to the Gale subject
list over MasterFILE’s is that it is all-in-one: you don’t have to think about
the nature of what you’re looking for (Is it a subject? a person? a place?).
You can look once and know for sure whether the topic you’re looking for
is there or not. For example, Academic OneFile at the time of this writing
Figure 2.5. Representation of the Academic OneFile Subject search interface.
EBSCOhost – printed on 2/4/2022 6:35 PM via UNIVERSITY OF BRITISH COLUMBIA. All use subject to https://www.ebsco.com/terms-of-use
20 Librarian’s Guide to Online Searching
does not appear to have any articles on the singing group Chanticleer. The
only Subject Guide entry is for Chanticleer and the Fox, which is helpfully
glossed “(Novel).” Being an all-in-one list, these parenthetical notes are
very useful. Other examples include “(Planet),” “(Medication),” “(Motion Pic-
ture),” the names of sports (to distinguish the different “World Cup” events),
and various others. In the Subject Terms results list, links to Subdivisions
and Related Subjects are provided if applicable, as well as “See” entries to
get you to the term Gale has decided to use (e.g., “Coffee addiction See Cof-
fee habit”). If one is willing to slow down enough to look through the list
of Subdivisions, it is well worth it, as examining the list can make finding
articles on exactly the aspect of [topic x] very easy and efficient. Looking at
the Subdivisions for Coffee (Beverage) provides an excellent example: just
looking for how much coffee is consumed? Try the Subdivision Consump-
tion data. Environmental aspects, Market share, Prices, Research, Risk fac-
tors, Statistics—the Gale indexers have done an excellent job identifying
the kinds of things people look for most often and which can be hard to find
without the human intervention of applying intelligent subject headings.
The Gale subject list also includes the number of results for every heading
and subdivision, which is extremely helpful. Being able to see the count lets
you know that Academic OneFile is probably a good place to find articles
about the “Health aspects” of coffee (945 results in May 2014), but perhaps
not for learning more about “Diseases and pests” of coffee (only 4 results in
May 2014).
Figure 2.6. Initial Subjects Index interface in EBSCO’s MasterFILE Premier. ©
2014 EBSCO Industries, Inc. All rights reserved.
EBSCOhost – printed on 2/4/2022 6:35 PM via UNIVERSITY OF BRITISH COLUMBIA. All use subject to https://www.ebsco.com/terms-of-use
Database Structure for Everyone: Records, Fields, and Indexes 21
EBSCO MasterFILE Premier: Subjects, Places,
& People Indexes
Even in a fairly simple display, there is a lot to look at and look for. In
this view of the subject index interface in EBSCO’s MasterFILE Premier
(Figure 2.6), A is the area identifying where we are: who is providing the
database, which database it is, and a search box to collect the results of our
choices from the Subjects, Places, and People indexes.
The section marked B tells us we are accessing the Subjects lists: there
are separate indexes for the Subjects, Places, and People fields. The interface
to these index lists allows us to simply start at the very beginning of the list
and browse forward, page by page, or to jump to any point in the index by
searching on a word or phrase, with the option of having our search term at
the beginning of the Subject entry (“Term begins with”), or anywhere within
it (“Term contains”). The third option, “Relevancy ranked,” will return all of
the Subject headings containing your search term arranged by relevance
rather than alphabetically (although it is hard to tell how “relevance” is be-
ing determined).
Figure 2.7. The Subjects Index in MasterFILE Premier, showing the
beginning of the “coffee” entries. © 2014 EBSCO Industries, Inc. All rights
reserved.
EBSCOhost – printed on 2/4/2022 6:35 PM via UNIVERSITY OF BRITISH COLUMBIA. All use subject to https://www.ebsco.com/terms-of-use
22 Librarian’s Guide to Online Searching
In Figure 2.7 we see the results of searching the subject list for the
word coffee. The entries in all caps are values from the field designated as
subjects in this database: terms the EBSCO indexers have chosen from the
predetermined list of subject headings for this database, that they feel cap-
ture the essence of the article’s content. We will talk more about this idea of
the “predetermined list” of subjects in chapter 3, but for now, just tuck away
the idea that entries in the Subjects list are not random: the indexers have
deliberately compiled this list of terms.4 Thus every article about the his-
tory of coffee is assigned the subject: COFFEE—History. But, as indicated
by the helpful “Use” note, if you are looking for articles about the cultivation
of coffee, rather than “COFFEE—Cultivation” you should “Use COFFEE
growing.” The number of results for each entry is not provided, which is a
bit annoying.
The “Places” list contains, obviously, names of places that have been the
subject of articles in this database, helpfully glossed with the name of the
country or state where they are located to disambiguate them (e.g., “abbe-
ville (ala.),” “abbeville (france),” and “abbeville (la.)” etc.). In the “People” list
you would find, obviously, names of people, but also of orchestras, musical
groups, and musical events, all glossed with the parenthetical note “(per-
former)” (e.g., “boston early music festival (performer)”). The entire content
of the Places and People lists is lowercase, which seems a little odd, but in
both of these lists the number of records for each entry is provided, which is
very helpful.
Field Indexes for the WorldCat Database
Moving on to our third example, OCLC’s WorldCat database (a union
catalog of library holdings from around the world) provides even more ex-
amples of the use of separate indexes for many fields. As in EBSCO’s Sub-
ject, Places, and People lists, the WorldCat Browse Index interface provides
the opportunity to roam around in the indexes, discovering what is there
(and thus, what is possible), before committing to a search. Some fields (such
as Author) are even indexed twice, in separate lists, creating one index for
single words only, and another for phrases. Figure 2.8 provides a drawing
of the initial view of the Browse Index interface (A), and an example of a
single-word and a phrase index for the same field. In the part of the draw-
ing marked B, the drop-down menu has been changed to Author, and in C to
Author Phrase. In the Author (single-word) index, you could browse only for
an author’s last name, for example, Austen. In the Author Phrase, you could
browse specifically for Austen, Jane.
You access the Browse Index screen via an icon in the WorldCat Ad-
vanced Search interface, discussed in greater detail in chapter 7. For …