Sunday, March 23, 2014

Is known item searching *really* an issue for web scale discovery?

Is known item searching really a big issue in Web Scale discovery?

Since I began looking at web scale discovery in 2009-2010, I've read many librarians comment on how known item search is harder in web scale discovery and it's not just the rank and file librarians.

In the latest Ithaka S+R US Library Survey 2013, in the section on discovery, for the question "To what extent do you think that your index-based discovery service has made your users' discovery experience
better or worse in each of the following areas?", Library Directors felt that "Helping users find  items they already know about" was Discovery's weakest area. (Figure 35)

On a personal note, when we went on to implementation of Summon in my own institution, some of the most negative feedback we received were from graduate students and Faculty, who lamented that many of the items they were looking for the catalog were now hard to find.

Hence it is with great interest I noticed the following Tweet by Dave Pattern of Huddersfield University Library, a known library innovator and a early adopter of Summon.
I believe he was reacting to earlier tweets coming out of ER&L 2014 , where a presenter claimed to have improved results by tweaking the ranking to improve among other things known item search. What followed was a long ranging discussion on Twitter, with many librarians and technologists working on discovery systems giving their two cents worth

Some claimed they heard of such complaints but could never get a credible non-contrived example and most examples surfaced were due to spelling errors. This group felt it could be more of a perception issue.

A few others felt it was a really problem at first, but the issue has been improved over the years.

Yet others (a smaller group) felt it was an important issue.

I myself am of the view that it is a issue that has gotten better with time, though issues remain. And yes, often the user complaining just remembers the one time out of hundred where the known item search fails to bring up the item, but it is still frustrating for a full tenured professor to suddenly fail to find a simple item when previously they could.

It's somewhat difficult to generalise though because some of us commenting are using Primo Central, others Summon etc.

Even within Summon implementations, results can vary as I have found often by comparing "fails" here with other Summon libraries

 1) Types of packages switched on (e.g If you turn on (where you can't generally tweak algorithms) Hathitrust, newspaper database packages, known item searching of catalog results get worse due to "crowding out" effect)

 2) Cataloging

That said, the question remains how bad is the known item search issue? Even one who is skeptical of known item search issues, will probably concede it will happen because there are many more results to sort through.

You can see this is the main reason for the problem, because most problems will disappear the moment you click "item in library catalogue" in Summon.

Currently in our instance of Summon, typing Freakonomics (which is the part title of a popular book commonly known as such and a frequent course reading), gets you only journal articles, newspaper articles, book reviews, anything but the book.

But refining to Library Catalog gets you the item.

The book Freakonomics become found only by restricting to item in Library Catalog

I agree that Discovery systems have harder jobs than opacs, but that is cold comfort to someone who used to be able to find a known item with one search in the OPAC.

Admittedly, as someone who is the point person to complaints on discovery services here, such issues loom large on my mind. Randomly looking through search logs in Google analytics also helps notice issues, though in reality the issue may not be that big.

There have been attempts to quantify this difficulty.

Most recent was Emily Singley's Discovery systems – testing known item searching where she tested 8 libraries using the 4 major discovery services.

The test is interesting in that it tried 4 types of queries
  • Single word titles (e.g. 1984)
  • Titles with “stop words” (e.g To have and have not)
  • Title/author keyword (e.g. Smith and On beauty)
  • Book citation (copied from bibliographies)
  • ISBN  
The results showed, WorldCat Local (name change to Worldcat discovery service coming?), came on top. Google was slightly behind followed by Summon, Primo Central and EDS.

Though interesting for comparison, the main issue as pointed out in comments was that the test set was not from a real world examples. Of course, Emily herself admits the test is "cursory".

Some libraries have done more specific tests like testing the top 1,000 most frequent known item search queries in logs to show their discovery service performs almost as well as the traditional OPAC. In my institution, we did the same for journal title/name searches, databases and books before launch. This helped a lot, but the long tail of searches means users will still run into issues in many cases.

Fear of issues with known item is not without precedent

In fact, this fear of known item search becoming harder has precedent before the current era of web scale discovery.

When library moved towards keyword searching as a default via "Next Generation catalogues" like aquabrowser, Encore, Primo there was a fear that known item searching would become harder compared to title browse.

I remember as a newbie librarian sitting in a committee worrying that keyword search would made known item search harder.

Was this fear borne out?

Known item searching - keyword searching vs title browse - a systematic test

Perhaps its instructive to study this example by University of Minnesota Libraries, where they systematically studied the effects of switching from

i) MNCAT classic - Aleph (Traditional catalogue typically title browse is default)

ii) MNCAT - Primo (basically next generation catalogue with keyword searching but no article index)

iii) MNCAT Discovery - Primo Central (Same as ii but includes article index)

H/T found via comment on Emily Singley's Discovery systems – testing known item searching blog post.


As explained in the very informative video above,  they randomly selected 400 items from search logs from their traditional OPAC to create benchmarks for MNCAT classic (OPAC) and MNCAT (Primo) and eventuall MNCAT Discovery (Primo Central)

These 400 may include items that the library did not have.

MNCAT classic was tested with "Title begins" - Or Title browse


MNCAT was tested with Keyword search.

If the entry appeared in the first 10, or "Did you mean" for MNCAT, it was considered found.

The results showed that 90% of results were the same (66% appeared in both, 24% neither).

8% of the time MNCAT classic found the item but MNCAT did not. And 2% the reverse happened.

The video goes on to study the differences in results.

What's the bottom line?

Technically the classic catalog won. 98% of the time, the classic catalgue worked correctly with known item searches , while the next generation catalogue with keyword searching worked correctly 92% of the time. (Assuming when neither search finds it, it is working correctly)

Is this difference significant? I would argue not.

Our own experience shifting to keyword searching in III Encore - a next generation catalogue also backups this experience, that keyword searching is generally as capable as title browse for finding known items.

A lot depends on how the algothrim ranks items of course (III's Encore algothrim is very well turned for known item searching matching title fields as highest priorities), but it seems to me as both traditional OPACs and next generation OPACs match only on traditional MARC and not on articles, so it's still relatively easy to get it right for known item searches.

What happens when you add a article index?

It will be very interesting to see the University of Minnesota Libraries results when they benchmark against MNCAT Discovery (Primo Central).

I will guess that known item search would be significantly worse (maybe 85%? particularly if we see author + title combos) without lots of customization because the challenge is now much harder sorting through all the newspaper and journal content.

A key to reduce this issue is the "Did you mean...." function. It's relatively easy to do this for searches for journal titles as some primo libraires have done, but needs to be done for books as well.

A "Did you mean" that could recommend popular textbooks based on circulation, presence in reading lists as well as other metrics could help as suggested by Dave Pattern.

There are other ideas not least which is bento style.....


It's pretty obvious that web scale discovery system will have tradeoffs and one of them is slightly less effective known item searching.

The question that isn't answered is, how big is the trade-off? The answers varies from audience to audience, my suspicion is that the popularity of Bento syle and or refusal to load catalogue data into discovery at some of the high ranked Ivy Leagues/ARLs suggests that known item search can be serious enough issues for some audiences to switch away from a "blended" style of results.

NCSU Libraries - Bento Style

The more graduate students and faculty you have, the greater likelihood they will be doing known item searches that aren't on the typical reading lists, "did you mean" checklists to help.

Granted a lot of searches they do can be challenging even for a traditional catalogue (looking for a particular edition of a common work for example), but web scale discovery makes it nearly impossible.

So what do you think? Do you think known item searching issue in web scale discovery is over-blown?

blog comments powered by Disqus

Share this!

Creative Commons License
This work is licensed under a Creative Commons Attribution-ShareAlike 3.0 Unported License.
Related Posts Plugin for WordPress, Blogger...