Monday, December 7, 2015

Measuring the value of contributed metadata in web scale discovery services

One of the more interesting issues around the rise of web scale discovery service systems like Summon, Primo, Ebsco Discovery Service and Worldcat Local is the place of abstracting and indexing (A&I) databases like Scopus, Web of Science or more disciplinary specific databases like Psycinfo.

While Publishers of full text like Sage, Springer, IEEE eventually realized it was to their benefit to contribute metadata to the index of web scale discovery services because it increased the find-ability of their full text to users on discovery services (IEEE going so far as to study obstacles to getting their content indexed in discovery services) and hence increased demand for their content, it was less clear why Abstract and indexing (A&I) databases should contribute their metadata to the discovery index.

 So for example let's say a user searches in a discovery service like Primo and finds the following record.

As you can see above this record is contributed by the A&I database Web of Science.

The user then clicks on View Online to see where to get the full text.

As seen above, the user can click on either targets/destinations of Proquest or DOAJ to get access to the full text via either two full text sources on those sites. (The links are generated using an Openurl resolver)

A&I services are left out in the cold

Let's recap the transaction.

The user is happy because he gets access to items he would have otherwise missed. Similarly the discovery service (Primo's Exlibris soon to be under Proquest) gains from making more items discoverable.

The actual content provider of the item (in the above case Proquest or DOAJ) is happy too, his content gets discovered and usage of his content will go up and be recorded.

The only one left out from this happy transaction is the A&I database vendor - Web of Science. As the user never actually goes into the A&I database, he may not even realize he just benefited from the library's subscription of the A&I database.

Usage of the A&I may in fact fall, as some libraries have reported, particularly if they are aware or dimly grasp that the same records in A&I database can be found in the discovery service.

This is an issue that is well recognized by NISO's  Open Discovery Initiative (ODI). Of course, most A&I databases require that the library be a mutual subscriber of most the A&I database and the discovery service before you can benefit from the metadata, so if the library values the metadata provided by the A&I, A&I databases will continue to be subscribed.

But here lies the rub, how do you know the metadata from the A&I database is making the difference in helping discovery? Particularly since many full-text providers are also giving away the metadata. Sure the A&I may have more or better metadata but how do you know it is making the difference?

Measuring the value of metadata/records contributed 

Up to recently, I wasn't aware of anyway to measure the value of the metadata contributed by a source (A&I, Publisher, Aggregator etc). However while playing around with Exlibris' Alma and Primo analytics, and lurking on the mailing list I noticed a interesting email by a UNSW librarian regarding the "Link resolver usage subject area" in Alma analytics.

Here's part of the message

"If the source has a colon in it, a user either was a staff member  testing the link within Alma, or got access to an article from within a database by being referred back to the uresolver to see if you have a subscription that covers it."

The first part is fairly straight forward so you will see sources listed such as

EBSCO:Business Source Complete  - 220 requests

ProQ:ABI/INFORM Global - 110 requests

info:sid/ - 55 requests

Elsevier:Scopus - 20 requests

Here we are talking about link resolver requests (typically branded Findit@xlibrary) from these databases. In the above example, we have link resolver requests from Business Source Complete, Proquest ABI/Inform Global , Web of Science on Web of Knowledge Platform and Scopus.

So the above shows users searching in Scopus and when they click on Find it @ SMU Library, the clicks will be recorded as source Elsevier:Scopus

This is pretty much standard affair if you are familiar with link resolvers.

The part that left me quite excited was this

"If the source has an underscore or is just some letters eg “wj” (Wiley journals) then the user got access to the article from a PCI record in Primo". Note : PCI = Primo Central Index, the name of the discovery service index.

If I read this correctly it means not only can we see link resolver requests from databases and the discovery service, we can actually see which source contributed the record that appeared in the Primo discovery service! 

So in the above statistics we can see that there were 4,666 clicks on records in the discovery service Primo with metadata from Scopus (scopus).  Similarly we can see 9,362 clicks on records in Primo with metadata from Wiley (wj) and from Web of Science (wos) 11,268.

So going back to the above example, when a user clicks on a record that originated in Web of Science but found in Primo, the click will be recorded as from the source "wos".

So it seems at least with Primo, we can now measure the value of the metadata provided by different sources!

In fact, in my preliminary tests for my institution, when counting clicks on records in the Primo discovery service, Web of Science as a source of records/metadata came in 3th compared to other sources. So it's pretty important.

Dealing with multiple versions of the same article/item

Discovery services of course often get more than one version of the same item from various sources. For any article, they may get metadata from Aggregators, Full text providers, A&I or other sources or on the same article.

How Primo handles it is that it will attempt to match and group records it "thinks" is the same article and display only one in the search results initially (Anyone know if this can be adjusted? or is covered in The Primo Technical Manuals?).  In the example below, the record from JSTOR is the main record showing.

I haven't tested this, but I assume if you click on "view online" without clicking on "view all versions", only one source (the record that is displayed that comes from that source in the above case JSTOR) will be credited.

Of course, you can click on "View all versions" to see other versions. This is very similar to how Google Scholar works.

Of course each of these records while they are very similar, do differ in small ways as they are from different sources. In my example, the records from MEDLINE/PubMed have slightly better subject headings and if I try to search with these subject headings, it is that record from MEDLINE/PubMED that appears as the main record in the search result as it no longer matches the record from JSTOR.

So far this makes a lot of sense, though there might be some squabbling over which source to "credit" discovery to if the search query matches more than one possible record.

Eg. If I do a search for a title combined with a subject search and records from two possible sources are matched should I credit discovery to both equally?

Grouping multiple records vs Single super record approach

The problem is that some discovery systems like Proquest's Summon practices a "Super record" approach.

"The index includes very rich metadata brought together from multiple sources using an innovative match-and-merge function. Match-and-merge allows Summon to take the best data from these sources and bring it together to complete a rich “super record.” - source

While this sounds like what Primo is doing it's actually quite different. In Primo while the system groups different versions of the same article, each version record is still retained seperately as you can see from clicking on "view all versions".

In Summon what happens is that if multiple versions of the same article is available from multiple sources, a "Match-and-merge" function will try to build a single merged/deduped "super record".

The super-record might include

a) Title/author/author supplied keywords from the publisher A and aggregator B
b) Subject headings from multiple A&I databases eg. Scopus and Pubmed
c) Table of contents from aggregator C

and so on.

I can see the attraction of such an approach, and from a user point of view it's probably cleaner as the user doesn't care which source contributes to the item getting discovered, so all he wants to see is one "super-record" with all the combined available data on it.

See for example the same article record in Summon below

Above you can see just a single record and the sources used to create the "super record" are listed. Under subjects you can see the combined entries drawn from various sources.

Because it's a single super record, you also increase chances of discovery. So for example if the person happens to be searching for the following three together in a advanced search

a) Title
b) Subject from Source A
c) eISSN from Source B

It will match Summon's super record but not any of Primo's individual records because no single record has all 3 items.

But a combined super record presumably means that it's going to be harder to do the same as what Primo did. Since it's all one record, when an article is found in Summon, how do you know which source contributed to the discovery?

Of course it's not impossible that Summon could still retain individual source records similar to Primo and use that to give credit to sources for aiding discovery.....


I'll end here with a statement from a ODI paper.

"A&I services have a complex set of issues relative to the ecosystem of index-based discovery. The producers of these services naturally have an interest in preserving their value, especially in being assured that libraries will continue to maintain their subscriptions should they contribute to discovery services.

Decisions regarding whether to participate in discovery services are not straightforward. Discovery
services not tuned to make use of the specialized vocabularies, abstracts, and other mechanisms inherent in the native A&I product may underexpose the resource. Aggregators that license A&I content and fulltext resources from other providers may not have the rights to further distribute that content. 

Discovery services must limit access to proprietary content, such as abstracts and specialized vocabularies to authenticated users affiliated with mutually subscribing institutions. Given these factors, among many others, A&I resources must be treated with special consideration within the body of content providers that potentially contribute to discovery services."

The ability to credit discovery to particular sources as found in Alma analytics goes some way in helping encourage more A&I services to contribute content to discovery services as their value can now be tracked.

Still this development does not completely solve all the concerns of A&I services.

For example, there is concern that relevancy algorithms in some discovery services may systematically under-privilege content contributed by A&Is (for example by weighting full text more than subject headings), leading to a devaluing of their content. See for example the exchange of letters between EBSCO, Ex Libris and Orbis Cascade Alliance.

In fact, the ability to track contributions to discovery from sources could backfire and lead libraries to undervalue A&I sources , now that they can finally see the impact of metadata contributed by them.

This is a hard nut to track. While one could come up with metrics that measure % of top X results that contain A&I sources (e.g the latest Primo Analytics provide something along that lines for results from Primo/Primo Central/Ebsco API) , it's still not possible to agree on what % should be reasonable as there is no gold standard for relevancy algothrims to compare against.

blog comments powered by Disqus

Share this!

Creative Commons License
This work is licensed under a Creative Commons Attribution-ShareAlike 3.0 Unported License.
Related Posts Plugin for WordPress, Blogger...