Thursday, October 15, 2015

Of full text , thick metadata , and discovery search

As my institution recently switched to Primo, nowadays I lurk in the Primo mailing list. I am amused to note that in many ways the conversation on it is very similar to what I experienced when lurking in the Summon mailing list. (One wonders if in time to come this difference might become moot but I digress).

Why don't the number of results make sense?

A common thread that occurs on such mailing lists from time to time and that often draws tons of responses is a game I call "Do the number of results make sense?".

Typically this would begin with some librarian or (technical person tasked to support librarians) bemoaning the fact that they (or their librarians) find that the number of results shown are not "logical".

For example someone would post a email with a subject like "Results doesn't make sense". The email would look like this (examples are made up).

a) Happy birthday    4,894
b) Happy birth*    3,623                                      
c) Happy holidays  20,591
d) Happy holid*    8,455
e) Happy OR birthday 4,323                                    

The email would then point out that it made no sense that number of results in b) and d) were lower than in a) and c) respectively. Or that e) Should have more results than a).

Other variants would include using quotes, or finding that after login (which usually produces more results due to results appearing from mutually licensed content providers) the number of results actually fell etc.

The "reason" often emerges that the web scale discovery service whether Summon Or Primo is doing something "clever" that isn't transparent to the user that results in a search that isn't strictly boolean logic.

In the past, I've seen cases such as

* Summon doing stemming by default but dropping it when boolean operators was used (might have changed now)
* Primo doing metadata search only by default but expanding to matching full text if the number of results dropping below a certain number.

I've discussed in the past How is Google different from traditional Library OPACs & databases?  and in this way web scale discovery services are somewhat similar to Google in that they don't do strict boolean and can do various adjustments to try to "help the user" at the cost of predictability and often transparency if the user wasn't given warning.

Matching full text or not?

In the most recent case I encountered in the Primo mailing list, it was announced there would be a enhancement to add a displayed message indicating that the search was expanded to match full text.

This lead to a discussion on why Primo couldn't simply match on full text all the time, or at least provide a option to do either like how EBSCO Discovery Services does.

MIT Libraries's Ebsco Discovery services searches in full text by default but you can turn it off.

An argument often made is that metadata match only, improves relevancy , in particular known item searching which makes up generally about 40-60% of searches.

For sure this makes relevancy ranking much easier since not bothering to consider matches in full text means the balancing act between ranking matches in full text vs metadata can be avoided.

In addition, unlike Google or Google Scholar, the discovery service index is extremely diverse including some content that is available in metadata only formats while others includes full text or are non text items (eg DVDs, videos).

Even if the items contain full text, they range from length in terms of a single page or paragraph to thousands of pages (for a book).

Not needing to consider this difference makes relevancy ranking much easier.

Metadata thick vs thin

Still a metadata match only approach ignores potentially useful information for full text and it's still not equally "fair", because content with "Thick metadata" still has a advantage over "Thin metadata".

I am not familiar with either term until Ebsco began to talk about it. See abstract below.

Of course "other discovery services" here refer mainly to Proquest's Summon (and Exlibris's Primo), which has roughly the same articles in the index but because they obtain the metadata directly from the publisher have limited metadata basically , article title, author, author supplied keywords etc.

While thick metadata would generally have controlled vocabulary, table of contents etc

The 4 types of content in a discovery index

So when we think about it, we can classify content in a discovery service index along 2 dimensions

a) Full text vs Non-full text
b) Thick metadata vs Thin metadata

Some examples of the type of content in the 4 quadrants

A) Thick Metadata, No Full text - eg. Abstracting & Indexing (A&I) databases like Scopus, Web of Science, APA Psycinfo etc, MARC records

B) Thick Metadata, Full text - eg. Ebsco databases in Ebsco Discovery Service, combined super-records in Summon that include metadata from A&I databases like Scopus and full text from publishers

C) Thin metadata, No Full text - eg Publisher provided metadata with no full text, Online video collections, Institutional repository records?

D) Thin metadata, Full text - eg Many publisher provided content to Summon/Primo etc.

What are the different ways the discovery service could do ranking?

Type I - Use metadata only - Primo approach (does expand to full text match if number of results falls below a threshold)

Type II - Use metadata and full text - Summon approach

Type III - Use full text mostly plus limited metadata - Google Scholar approach?

Type IV - User selects either Type I or II as an option - Ebsco Discovery Service approach

The Primo approach of mainly using metadata (and occasionally matching full text only if number of results are below a certain threshold) as I said privileges content that has thick metadata (Class A and B) over thin metadata (Class C and D) but is neutral with regards on whether full text is provided.

Still compare this with a approach like Summon that uses both metadata and full text. Here full text becomes important regardless of whether you have thin metadata or thick metadata it helps to have full text as well.

All things equal would a record that has thick metadata but no full text (Class A) rank higher than one that has thin metadata but has full text? (Class D).

It's hard to say depending on the algorithm used to weight full text vs metadata fields,I could see it going either way. Depends on the way things are weighted I can see it going either way.

My own past experience with Summon seem to show that there are times where full text matches seem to dominate metadata. For example searching for Singapore AND a topic, can sometimes yield me plenty of generic books on Singapore that barely mention the topic over more specific items. I always attributed it to the overwhelming match of the word "Singapore" in such items.

The fear that the mass of full text overrides metadata is the reason why some A&I providers are generally reluctant to be included their content in discovery services. This is worsened by the fact that currently there is no way to measure the additional benefit A&I's bring to the discovery experience, as their metadata once contributed will appear alongside other lower quality metatdata in the discovery service results.

If by chance the library has access to full-text via Open URL resolution, users will just be sent to the full text provider while the metadata contributed by the A&I database that contributed to the discovery of the item in the first place is not recognised and the A&I is bypassed. This is one of the points acknowledged in the Open Discovery Initative reports and may be addressed in the future.

In fact implementation of discovery services can indeed lead to a fall in usage of A&I databases in their native interfaces as most users no longer need to go directly to the native UI. Add the threat from Google Scholar, you can understand why A&I providers are so wary.

I would add that this fear that discovery services (except for Ebsco which already host content from A&Is like APA's PsychInfo) will not properly rank metadata from A&Is is not a theoretical one.

Ebsco in the famous exchange between Orbis Cascade alliance and Exlibris,  claims that

As you are likely aware, leading subject indexes such as PsycINFO, CAB Abstracts, Inspec, Proquest indexes, RILM Abstracts of Music Literature, and the overwhelming majority of others, do not provide their metadata for inclusion in Primo Central. Similarly, though we offer most of these databases via EBSCOhost, we do not have the rights to provide their metadata to Ex Libris. Our understanding is that these providers are concerned that the relevancy ranking algorithm in Primo Central does not take advantage of the value added elements of their products and thus would result in lower usage of their databases and a diminished user experience for researchers. They are also concerned that, if end users are led to believe that their database is available via Primo Central, they won't search the database directly and thus the database use will diminish even further.

Interestingly, Ebsco discovery service itself splits the difference between Primo and Summon and allows librarians to set the default of whether to include matching in full text or metadata only but allows users to override the default.

From my understanding default metadata only search in EDS libraries is pretty popular because many librarians feel metadata only searching provides more relevant results.

I find this curious because EBSCO is on record for stating that their relevancy ranking places the highest priority on their subject headings rather than title, as they are justly proud of the subject headings they have.

One could speculate EBSCO of all discovery services would weigh metadata more than full text, but librarians still feel relevancy can be improved by ignoring full text!

Content Neutrality?

With the merger of Proquest and Exlibris , we are now down to one "content neutral" discovery service.

One of the fears I've often heard is that librarians fear Ebscohost would "push up" their own content in their discovery service and to some extent people fear the same might occur in Summon (and now Exlibris) for Proquest items.

Personally, I am skeptical of this view (though I wouldn't be surprised if I am wrong either).  but I do note that for discovery vendors that are not content neutral, it's natural that their own content will have at the very least full text if not thick metadata while other content from other sources is likely to have poor quality metadata and possibly no full text unless efforts are taken to obtain them.

This itself would lead to their own content floating to the top even without any other evil doing.

To be frank, I don't see a way to "Equalize" everything , unless one ignores full text and also only ranks on a very limited set of thin metadata that every content has.

Ignoring metadata and going full text mostly?

Lastly while there are discovery services that rank based on metadata but ignore full text, it's possible but strange to think of a Type of search that is the exact opposite.

Basically such a system ranks only or mostly on full text and not on metadata (whether thick or thin)

The closest analogy I can think of for this is Google or Google Scholar.

All in all, Google Scholar I guess is a mix of mostly full text and thin metadata so this helps make relevancy ranking easier since we are ranking across similar types of content.

Somehow though Google Scholar still manages to do okay.... though as I mentioned before in
5 things Google Scholar does better than your library discovery service has a big advantage as

"Google Scholar serves one particular use case very well - the need to locate recent articles and to provide a comprehensive search." compared to the various roles library discovery services are expected to play including known item search of non-article material.


Honestly, the idea that libraries would want to throw well available data such as full text to achieve better relevancy ranking is a very odd one to me. 

That said we librarians also carefully curate the collections that are searchable in our discovery index rather than just adding everything available or free , so this idea of not using everything is not a new concept I guess.

blog comments powered by Disqus

Share this!

Creative Commons License
This work is licensed under a Creative Commons Attribution-ShareAlike 3.0 Unported License.
Related Posts Plugin for WordPress, Blogger...