Sunday, October 20, 2013

6 things I am wondering about discovery

I blogged 8 things we know about web scale discovery systems in 2013 , an attempt to summarize the current consensus after 4 years of web scale discovery service use in libraries and hundreds of research papers and presentation. (Not sure what they are?)

It was a post that seemed to be pretty widely shared and read, but drew no comments, from which I conclude I didn't really say much that was controversial.

This time, I am back to think aloud about things I feel about issues where things are still up in the air. The same qualifiers I made in the last post apply, I am not an "expert", my familiarity is mostly with Summon and to lesser degree EDS etc.  

Be warned it's a long ramble.

1. Are Blended results a good idea? Or should we implement Bento style search results?

The original incentive for discovery services, was because users were telling us they wanted a "one-search" to put all our results regardless of content type (particularly journal articles which were not in catalogues) into a single search "like google", instead of going to a separate silo for each type of content.

As such it seemed obvious, that the difficult part was to get all your content into a single index (those pesky content providers needed to agree), and presentation of results was a single matter of using relevancy ranking techniques similar to what web search engines do to order then.

However, currently doubts have began to surface about the wisdom of mixing up all the search results from different content types such as books, journal articles, newspaper articles etc together in one result list (the so called blended search model).

Some have pointed out that even Google has silos, for example the main google search does not (usually see later) mix up results from the main google search with Google books, Google Scholar, Google news.

So what is to be done? An attempt is to improve that is becoming popular is to do a "bento style" results list, with results segregated by content types. Many University Libraries around the world such as Columbia University Library, NCSU State University Library, Princeton University Libraries, Dartmouth etc are doing so.

A point of clarification, I consider both simple two column style interfaces with "Books & more"  and "Articles" columns (e.g Villanova University's Vufind implementation is probably the most well known example) or results with multiple boxes (say NCSU's example) as both examples of bento style.


As noted here , there are different ways to achieve this from libraries loading the discovery index into open source systems such as VuFind, Blacklight, Xeres, or those who create their own custom jumping off webpage.

Why are bento style interfaces better? (What follows is mostly analysis given here)

First, it eases the burden on the relevancy algorithm, as currently the relevancy system must decide how to rank items from totally different content types with totally different amount of data to work with (eg some are full-text available, some metadata only)

Secondly, by including only a "book" or "books &more/catalog" bento box, it caters for faculty and research staff who care only for those items. This also can indirectly solve the known item search issue for books.

Thirdly there is mixed evidence on whether the blended style is what users want  according to a bibliographic Wilderness blog post. My own personal experience handling user feedback on Summon implementation here is that

i) As already mentioned, discovery services make finding of catalogue material occasionally harder, you can always filter for them but it slows down the process.

ii) Less experienced users have a big problem trying to differentiate material types, and discovery services often don't highlight such differences clearly, so there are users who can't tell the different between a ebook and a online book review, or have problems telling if a certain content type is a newspaper article or a journal article. For discovery services like Summon that have a lot of newspaper article content this can lead to less experienced users producing poor citations. This is one of (but not only) the reasons this paper concluded that EDS users produced better citations for a assignment as judged by Librarians

Lastly, by implementing your own interface layer which is what is required if you want a bento style interface, you are protected from future changes in discovery providers as you can retain the same interface, just plugin results via API.

The main issue here of course is that implementing such bento style systems requires quite a bit of work. It's unclear if the big 4 web scale discovery services will start to roll out native interfaces that offer bento style results, but Summon 2.0 is rolling out a alternative model, called content spotlighting showing "grouped results" for some content types.

This draws inspiration I believe from how Google sometimes embeds results from Google News into a web search. Below shows our a web search for "Nexus 7" gets you mostly web pages but in the 3rd result, you see results from Google news.

And of course Summon 2.0 has the same idea.

In the above exanple, you see a grouped result for newspaper articles, where Summon 2.0 highlights the top 3 ranked newspaper articles in the midst of other results, and if you click on "news results for...." it will just show newspaper articles.

It has been announced this grouping will work for "Reference" material as well as possibly other content types in the future (I am hoping for "book reviews").

I would add that, this idea isn't new in Library systems, our version of Encore/Encore Synergy for example, is actually the reverse of Summon 2.0, where it shows catalogue results, with selected journal articles appearing in the middle of results.

2. How should we cater for advanced users who have needs for advanced features in discovery systems? Or should we even do so?

One of the greatest frustrations by librarians and advanced users with regards to discovery systems besides relevancy is they tend to compare the features they have in databases and wonder why discovery systems lack them.

Some of this could be due to fundamental limitations in the discovery service. For example, one could never have a controlled vocabulary system like MESH for all articles (with all the included benefits such as browsing capabilities, thearsui etc) in a discovery service because the items in a discovery service all come from different sources. Not without huge amount of efforts to do crosswalks anyway and even that might not be possible.

Similarly one could never have extremely obscure granular filters /features of use for just a given discipline (e.g  A way to search for chemical reactions like scifinder?).

On the other hand, some features are I think relatively trivial and could be added for example when I was in the Greater China serialssolutions meeting, a librarian asked for sort by times cited function similar to what is in Scopus etc.

On the practical level, I can see some possible obstacles to this feature, but the fundamental question is, should we include a feature set used by a very very small percentage of users?

Here's how I see it, because there is a limit to how many features you can squeeze into the discovery interface, I see 2 paths

1. Put in a small set of most commonly used standard features from native databases  + unique features that only discovery services can have due to the scope, or to address specific issues of discovery.

2. Put in as many standard features you see in most other native interfaces

Path one would make discovery a unique tool in the toolkit of searchers that would complement native databases.

Path two would make the discovery search a closer substitute of native databases and reduce the need to use multiple tools.

Which path is better?

One wonders though if something like  SciVerse Application Framework was implemented for discovery services, where developers could make plugin/gadgets for Sciverse Scopus and offer them for use.

Users could shop for and add the ones they want.

The above shows the Sciverse application gallery, where you can shop for apps you want to use with Scopus and Sciencedirect. 

This allows a customization at the level of each user, so you wouldn't have a one size fits all interface. Institutions could set default applications/plugins but users would be free to turn on or off the ones they wanted.

There many be practical implications though in terms of reuse of data?

Undergraduates vs advanced users

Another dimension to this issue is, what is the target audience of discovery services?

Serialssolutions' line tends to be Summon does not replace traditional databases (whether full text or A&I) and that there is always a place for it. The line goes is that Summon is good for undergraduates as a starting or jumping up point to find good enough articles but they should also go to databases for more serious research.

Ebscohost's EDS, tries to differentiate itself by claiming their service is for everyone not just undergraduates, fitting the needs of more advanced users as well.

In a page titled "Beyond undergraduate" - the page read "If a discovery service should truly encompass a university's entire collection, shouldn't it also cater to its entire user base? With an unparalleled user experience and the inclusion of important subject indexes used at the graduate and post-graduate level via Inclusion of Subject Indexes, EDS is poised to debunk the myth that discovery is merely an “undergraduate” resource."

I would say you can see this focus from not just the attempt to focus on indexes but the amount of searching options built-in for EDS. Of course this is because EDS is built-off the existing ebsco platform, but as a librarian who loves control, EDS interface appeals to me.

So we basically have a difference of philosophy here it seems. EDS is designed for the advanced users, with a multitude of search features while Summon is designed with a fewer but well chosen set of features that the majority use.

As we shall see later in #3, in general EDS also pushes users to discover content even if the library does not have direct access to it, a feature that is good for advanced users but confusing to less advanced ones.

The implication of this is EDS would likely capture a greater portion of advanced users compared to Summon, because Summon lacks some of the advanced search features, such users are used to.

NOTE : EDS is actually essentially already a database platform though it does include non-ebsco content which explains why it has more native database features in the first place.

Would that imply that the displacement effect where more users are no longer searching in the native databases compared in the discovery service be even stronger in EDS compared to Summon? We already see this impact for Summon, but what about EDS? More research needed here.

I can't imagine a world where EDS or any discovery has totally killed off use of native interfaces, because some do provide very unique value propositions (e.g Scifinder, Pubmed, Scopus), but I do imagine a philosophy that tends towards EDS's could potentially be more disruptive to native interfaces as compared to Summon's complementary to database approach.

I would add that the earlier discussion of "standard features from native databases" vs "unique to discovery features" is probably independent of the target audience question as both approaches would be valuable to all types of users.

3. Should discovery services provide full text only  results or include results with abstract only by default? Should they only search the metadata or full-text by default?

Should discovery services show only items the library owns either as a physical copy or has full-text only (by default), or should we show items that we may not have access to?

The argument for showing only items that are available in the library or full-text, is that the heaviest users of such services are undergraduates and they carry on the mindset from catalogues to library web scale discovery services (after all both are default searches), where everything you list in the catalogue, you can get.

So if you list items you can't get immediately (also some libraries don't offer or offer limited document delivery services or interlibrary loans to undergraduates) , they get upset.

Summon itself by default mostly positions itself to show full-text only (either free or subscription matching your holdings), though there are exceptions where you will see "citation only" results.

For example, if you include Abstract & indexing databases (A&I databases), open access, free packages which are "All or nothing" ("Zero title databases" in Serialssolutions speak) and other Institutional repositories you may see "citation only" results.

There are two situations where you may see citation online results

a) Pre-login
b) Post login

Typically pre-login you won't see any citation online items, except in rare situations. Below shows one example.

The above shows an example of a "Citation online" result from ERIC database in Summon pre-login. In general such examples are rare as most A&Is, do not allow display of their data without authentication.

Additional note, many institution repositories packages , journal packages, free A&Is may be "all or nothing" (e.g Proquest dissertation and thesis, Henry Stewart talks), so the moment you turn it on, you get the whole set of results whether you have access to each item or not. There is no matching to your holdings.

Most A&Is do not allow display of their results to unauthenticated users so in Summon, the results from A&I databases are less problematic since most of them (Web of Science. MLA etc) do not provide metadata for free, so will require users to login first before searching, something most users won't do, so only experienced users who go looking for it will find it.

Login first before searching and you will see additional results some of which are "citation online" only

Once you have done a login, you will notice additional results appearing compared to the earlier search. Some are full-text items, some are not.

Due to the desire of many libraries to show only full-text even for A&I results, Serialssolutions introduced a "Exclude Citation Online Content", which would hide all citation online results. It's unclear to me how many Summon libraries have turned this on.

Ebsco discovery service has a different model when detailing with A&I results that require authentication.  Even prelogin the results will show that a result has matched a subscribed A&I but won't give any details not even the title, and encourages you to login to see the citation.

As shown above EDS will show results like #4, where you will have to login to see the item from a restricted A&I source (mostly like but not always something you have no full-text access). This is unlike Summon, where you would be unaware such a result exists unless you logged-in first.

Which model is better?

Even leaving this difference aside with regards to A&I, somewhat interestingly for EDS, there is a limiter called "Available in Library Collection" (this means physical and full text online) .

It's somewhat confusing what it does (explanation here), but I would guess it would roughly be the flipside of Summon's "Add results beyond your library's collection", which will show the whole index of Summon (less restricted A&I) items.

So if in EDS you do not have "Available in Library Collection" on , you will see unsubscribed content from say Jstor appearing I believe.

If we go by the premise that the default settings are the ones that you don't need to turn on, it would seem in EDS the recommended default is to show results even if there are not available in the library collection?

Of course, this might be a simple artifact of the existing ebsco platform, in any case some libraries like MIT library currently do not have this on by default.

Overall though, It seems to me EDS libraries will usually tend to show more citation results to users than Summon and it's unclear to me which model is actually better. I guess it depends on the sophistication of users you are dealing with. EDS may be positioning itself to deal with more advanced users.

Matching on metadata vs fulltext

A somewhat smaller issue I have been pondering on, is whether the search should by default match full-text + metadata or just metadata. While full-text searching is more powerful in theory, there are complaints that the relevancy ranking systems of discovery services is often not good enough and often surface irrelevant content because of some chance matching in full-text of books or articles.

I have seen EDS libraries with either option on or off by default, libraries that do not have "Searched full text of articles", are generally saying they don't want to rely on full-text matching, because of poor results particularly for known item search?

In Summon, there is no option to restrict search to just metadata generally, though there was a recent algorithm changed that by default, would restrict matching in full-text to within 200 words which can help combat the issue where the keywords appear pages apart but are totally irrelevant. Adding quotes (the sign of a "advanced user") would turn this function off. As of Oct 2013, this seems to be removed.

4. Will Discovery services lead to the decline of Abstracting and Indexing services?

Are A & I Services in a Death Spiral? considers only the impact of Google Scholar without even considering the effect of Discovery services which only hastens the trend.

Particularly as a new generation of researchers grow up, always having access to full-text the idea of "abstract only" results is extremely alien. Even now, I get graduate students who express shock that there are results with no full-text, "What is the point of including them then?" they ask.

Currently more A&I content is being fed into discovery systems (something that I wouldn't have expected), with Scopus and Web of Science working with the main discovery services. In addition, Summon itself now supports ERIC, MLA and over 100 A&I databases.

EDS covers even more A&I and boasts of Platform blending , it was explained to me by a EDS vendor that this was unique to EDS and the only way certain A&I content holders like APA would allow their content to be included in a discovery service. You might also want to see the following exchange of letters between Ebsco, Ex Libris and Orbis-Cade alliance   saying the same thing.

So it seems the future of A&Is are secure given that you can't cancel them without losing the content in discovery.

But the question is, how much of this content is actually unique? We no longer live in a world where you needed to use A&I to check if something exists.Most publishers of content are perfectly happy to push their metadata to as many places as possible, be it Google Scholar, or discovery services.

In many cases, a unrestricted search of a discovery's index ("search beyond library collection" in Summon) provides as good a index as the A&Is.

While it is true that a lot of this data is not retrospective and may miss out some of the more obscure content providers, as time passes this becomes a smaller problem has discovery services become ever more encompassing bringing in even non-english content providers.

A&Is do hold an edge in better indexing but it's a open question how much this helps which brings us to the next issue......

5.  How much metadata is needed for good relevancy? Is "thin metadata + full text" sufficient?

This is a age old debate with EDS obviously claiming this is of greater importance due to their better store of index terms compared to rivals. It is however extremely difficult to measure the additional relevancy boosts as a result of this so it's unlikely we will see this resolved.

As stated in  8 things we know about web scale discovery systems in 2013 the head to head tests are mixed.

A common idea floating around is that while Google can do world class relevancy ranking with mainly full-text and little indexing, they have advantages library can't match due to their willingness to track users and use signals that libraries can't and won't do.

6. Will Discovery services lead to the decline of OPAC? Or a new breath of life?

The traditional discovery service will harvest MARC records from the ILS then display the results in the discovery search results, but the amount you click on the result, it will direct you to the traditional catalogue.

Recently the thinking seems to be that this leads to poor usability as the user will suddenly be dropped into a totally different interface that can be jarring.

There seems to be three approaches to solving this.

a. Library discovery vendors who are already ILS vendors eg Ex Ex Libris offer a combined product with Primo Central.

b. Library discovery vendors partner with ILS vendors eg Ebsco discovery service partners with Innovative Interfaces Encore and other ILS

c. Libraries using open source interfaces eg Vufind and piping in Discovery index results - basically a DIY version of b.

All third approaches are interesting that make the library catalogue the base, and pipe in results from the discovery service index. You get back the familiar catalogue (hopefully next generation catalog) interface, you can do loan related transactions directly from the interface (eg place holds).

There is also no time delay where you catalogue a record in your ILS and it doesn't show up in your discovery service.

The interesting thing that occurs to me with this arrangement is, how would relevancy be done? Would we really be talking of one combined index of catalogue results and discovery results?

Presumably this would be the case in option a) where the ILS and discovery vendor are the same. But in cases of b) and c) where the API is used, it seems to me relevancy would be a bit more difficult.


This has been a long rambling post, hope it was of some value.

BTW If you want to keep up with articles, blog posts, videos etc on web scale discovery, do consider subscribing to my custom magazine curated by me on Flipboard or looking at the bibliography on web scale discovery services)

Saturday, October 5, 2013

Mining acknowledgements , Library DIY & creative Information literacy

After several posts in a row about discovery services, let's have a change of pace and let me share with you some interesting ideas in the world of librarianship that I am playing with lately.

1. Measuring value of special collections by mining for thanks and acknowledgements in Google books. 

I've talked about how one could track thanks and acknowledgements of your library or librarians on social media and put them together via Storify.

But what about more formal acknowledgement of thanks from Scholars?

The idea here is simple, use Google books and search through for books where Scholars acknowledge your library or librarians for assistance rendered.

I had toyed with same idea in the past, but Chris Bourg and Jacque Hettel of Standford University Library have implemented the idea far more thoroughly that I could have imagined.

 Jacque Hettel has generously shared the procedure in two blog posts so far.

1. A method for measuring "Thanks" Part 1 : A search for thankful candidates  

2. A method for measuring "Thanks"Part 2 : Scraping Query Results For Analysis In a Collaborative  Project

The first blog post is fairly obvious, you search for acknowledgements mentioning your library. The blog post gives the sample string used by Stanford University Library. I easily modified it to work with my University.

The search query used was

(~thank | ~acknowledge) & (“NUS” | “National University of Singapore” | “NUS Library” | "NUS Libraries") & (“Singapore Malaysia collection”| "Singapore-Malaysia collection" | ~library)

There were quite a few false positives and looking at some of the results, I realised I could refine it further by adding a few more search terms like names of librarians but this was good for a start.

Personally, I think the main contribution here is in the 2nd post on how to scrap the data from Google books.

This post generously shares how to extract the data, even if you have no technical knowledge by installing and using a Chrome extension. I managed to follow the instructions in a couple of minutes and pulled the data into Google docs and later Excel.

As recommended in the blog post, you probably want to show 100 results per page to speed up scraping page by page. You probably also want to refine the search by date ranges to reduce the results to something more manageable.

Chris Bourg and Jacque Hettel have clearly done a lot of thinking and work on this, and will giving a talk at The 2013 DLF Forum in Nov. So any comments and thoughts I have on this idea is likely to be pretty superficial.

But I will just share a few wild thoughts and observations.

Firstly, perhaps is the fault of my search string, but you still have to spend quite a lot of time looking through the hits extracted. In some cases, it seems the preview snippet doesn't show the key words matched for various reasons (‎Snippet view or preview version but the section is not shown for preview), so you have to throw it out, or go check the print or online copy directly if you have it.

Secondly, besides extracting those results what could one do with it? I was thinking, one could somehow use the books title extracted, somehow match it to book covers and do a "Books that would have being poorer if not for our library" online exhibition?

Thirdly, could this be extended for other sources? How about Hathitrust (overlaps with Google books also no preview)? The other obvious case here would be to try for acknowledgements in thesis and dissertations. Most libraries including ours have their own ETD (electronic thesis and dissertations) in their Institution repositories, so that would be an easy win.

Beyond that one could try sources like Proquest dissertations and thesis or other full-text source to find acknowledgements from beyond your institution.

On a broad level, this is an excellent idea to showcase the value of our librarians and collections, though one wonders if this might lead to another arms race, with libraries or librarians trying to get acknowledged if god forbid this becomes a way of evaluating performance.

2. Library DIY - a choose your own adventure type knowledge base

The always excellent Meredith Farkas blogged about her new project Library DIY a few months back

So what is it?

"The content in Library DIY is designed to mirror a reference desk transaction more so than an instruction session. Much like in a reference interview (where we elicit more specific information about the student’s need), students can drill down to just the piece of information they need rather than having to skim through a long tutorial to find what they want."

You can try it live yourself at  but essentially, it allows one to quickly drill down to the information you need, say for example if you select

"I need to find sources for my research",  you are asked "what type of sources do you need".

If you selected "I am looking for articles" to that question you would see the following

Eventually though you would of course terminate at a page with answers.

It's a really interesting and different take on how to present faqs and instructional help guide, relying on browsing rather than searching.

Reaction from librarians have been good, but as Meredith herself wisely commented librarians are a different breed from our users so we need to see if users like this as well. Or would they just prefer to search for answers?

One thought that strikes me is, we can use this "choose your own adventure" way of browsing for answers not just for students but also for internal use aka for our own librarians. Could we use this to train our own librarians? Organise our own complicated internal knowledgebase?

I believe that the site above is powered by Drupal, we don't have this here, so I am experimenting trying to duplicate this with LibGuides and Confluence wiki which we have access to.

3. Reading Walsh, Andrew and Coonan, Emma's Only Connect … Discovery pathways, library explorations, and the information adventure.

Unfortunately, I've always found books or articles on Information Literacy pretty dull, but this book put together by Andrew Walsh and Emma Coonan reads anything but that.

Available as a free e-book under a Creative Commons licence it's really worth a read.

It's one of the most creative books I have ever read, each chapter by a different author is unique and quirky. For example there is one where it is presented as a prezi, creative metaphor use - one chapter uses the playing-card metaphor, another talks about the "Fish scale of academicness" - using the "trawling for information" idea, another uses the research quest as the "Hero journey" conceit, with librarians, supervisors as wise old companions etc.

Really brilliant if you haven't seen it yet, with series of Youtube videos + simple quiz.

Part one of 9 videos with the idea of research as a hero's quest 

Parts of the book probably just go over my head, but still a very good read if you are jaded about information literacy and are looking for new ideas.


At first glance all these projects seem to be totally unconnected.  The first mining for acknowledgements is an attempt to show our value to the powers that be. While the second is an attempt to directly help our users especially those who like to DIY and "do it yourself". The third is perhaps at least partly for the librarian himself or herself for self improvement.

Still, I feel all these types of projects have value as we need to simultaneously do the following

i) do the right things - do things that our users will value. This is a hard problem, it's a problem that vexes even the biggest and most successful commercial companies, most of the time it isn't obvious what is worth doing and you can't just play this by numbers. Libraries should perhaps spend more time on this.

ii) do things right - once we identified what we should be doing, we should ensure they are done efficiently and effectively and often this involves improving back-end processes, protocols, teaching methods etc. Libraries have traditionally being very good at this because we tend to be detailed oriented and like working with facts and figures and work hard to optimize these values.

iii) communicate our value - sadly even if you do the above two, it is easy to be taken for granted, so we need to be on a look out to find ways to communicate our value or we will be overlooked. This is probably the hardest for us as a profession to do, because librarians I would say tend to be quite self-effacing.

A right balance of all 3 types of projects will help libraries thrive of course.

Share this!

Creative Commons License
This work is licensed under a Creative Commons Attribution-ShareAlike 3.0 Unported License.
Related Posts Plugin for WordPress, Blogger...