Sunday, July 27, 2014

Size of Google Scholar vs other indexes, personally tuned discovery layers & other discovery news

Regular readers of my blog know that I am interested in discovery, and the role academic libraries should play in promoting discovery for our patrons.

If you feel the same, here are a mix of links I came across recently on the topic that might be of interest

The Number of papers in Google Scholar is estimated to be about 100 million

When talking about discovery one can't avoid discussion of Google Scholar. My last blog post on 8 surprising things I learnt about Google Scholar, raced to the top 20 all time read blog posts in just 3 weeks showing intense interest in this subject.

As such, the Number of Scholarly Documents on the Public Web is a fascinating paper that attempts to estimate the number of Scholarly documents on the public web using the capture/recapture method and in particular it gives you a figure for the number of papers in Google Scholar.

This is quite a achievement, since Google refuses to give this information.

It look me a while to wrap my head around the idea, but essentially it

  • It defines number of Scholarly documents on the web as the sum of the papers in Google Scholar (GS) and Microsoft Academic Search (MAS)
  • It takes the stated number of papers in  MAS to be a bit below 50 million.
  • It calculates the amount of overlap in papers found in both GS and MAS. This overlap needs to be calculated via sampling of course.
  • The overlap is calculated using papers that cite 150 selected papers. 
  • Using the Lincoln–Petersen method, the overlap of papers found and the given value of about 50 million papers in MAS , one can estimate the number of papers in Google Scholar and hence the total sum of papers on the public web. (You may have to take some time to understand this last step, it took me a while for sure)
There are other technicalities such as the paper estimates only English Language papers, being careful to sample papers with less than 1,000 cites (because GS allows only 1,000 results to be shown at most) .

For more see also How many academic documents are visible and freely available on the Web? which summarises the paper, and assesses the strengths and weaknesses of the methodology employed in the paper.

The major results are 

  1. Google Scholar has estimated 99.3 million English Language papers and in total there are about 114 million papers on the web (where web is defined as Google Scholar + MAS)
  2.  Roughly 24% of papers are free online
The figures here are figured to be a lower bound, but it is still interesting as it provides a estimate on the size of Google Scholar. Is 99.3 million a lot?

Here are some comparable systems and the sizes of indexes I am aware of as of July 2014. Scopes might be slightly different but will focus mostly on comparing scholarly or peer reviewed articles which are the bulk of most indexes anyway. I did not adjust for including English Language articles only though many of them do allow filtering for that. 
  • Pubmed - 20-30 million - the go to source for medical and life sciences area.
  • Scopus - 53 million  - mostly articles/conference proceedings but now include some book and book chapters. This is one of the biggest traditional library A&I databases, it's main competitor Web of Science is roughly at the same level but with more historical data , fewer titles indexed.
  • Base - 62 million -drawn from open access institutional repositories. Mostly but not 100% open access items and may include non-article times
  • CrossRef metadata Search - 67 million - Indexed dois - may include book or book chapters. 
So far these are around the level of Microsoft Academic Search at about 50 million.

Are there indexes that are comparable to Google Scholar's roughly 100 million? Basically the library webscale discovery services are the only ones at that level

  • Summon - 108 million - Scholarly material facet on + "Add beyond library collection" + authenticated = including restricted A&I records from Scopus, Web of Science and more. (Your instance of Summon might have more or less depending on A&I subscribed and size of catalogue, Institutional repositories). 
  • Worldcat - 2.1 billion holdings of which 148 million are peer reviewed, 203 million articles [as of Nov 2013]
I am unable to get at figures for the other 2 major library webscale discovery services - Ebsco Discovery Service and Primo Central, but I figure they should be roughly at the same level.



108 millions Scholarly material in Summon - may vary for your Summon Instance



  • Mendeley - 181 million ? This is an interesting case, Mendeley used to list the number of papers in their search but have removed it. The last figure I could get at is 181 million (from wayback machine), which fits with some of the statements made online but looks a bit on the high side to me. 

The figures I've given above with the exception of Mendeley I would think tends to be pretty accurate (subject to the issues of deduping etc) at least compared to the estimates given in the paper.

I think the fact that web scale discovery services are producing results in the same scale >100 million suggests that Google Scholar figure estimated is in the right ballpark. 

Still my subjective experience is that it seems that Google Scholar tends to have substantially more than our library web scale discovery service, so I suspect the 99.3 million obtained for Google Scholar is an underestimate. 

I wonder if one could use the same methodology as in The Number of Scholarly Documents on the Public Web to estimate the size of Google Scholar but using Summon or one of the other indexes mentioned above to measure overlap instead of Microsoft Academic Search.

There are some advantages

For example, there is some concern that the size of Microsoft Academic Search assumed in the paper to be 48.7 is not accurate but the figures given for say Summon are likely to be more accurate (again issues with deduping aside).

It would also be interesting to see how Google Scholar fares when compared to a index that is about twice as large as MAS.

Would using a web scale library discovery service to estimate the size of Google Scholar give a similar figure of about 100 million? 

Arguably not since we are talking about a different populations ie. MAS + GS vs Summon + GS though both can be seen as a rough estimate of the size of scholarly material available in the world that can be discovered online. (Also are the results you can find in Summon be considered the "public web" if you need to authenicate before searching to see a subset of results from A&I databases like Scopus?)

The main issue though I think to trying to use Summon or anything similar in place of MAS is a technical one.

The methodology measures overlap in a way that has been described as "novel and brilliant", instead of running the same query on the 2 searches and looking for overlaps, they do it this way instead.

"If we collect the set of papers citing p from both Google Scholar and MAS, then the overlap between these two is an estimate of the overlap between the two search engines." 

Unfortunately none of the web scale discovery services have a cited by feature (they do draw on and display Scopus and Web of Science cited counts but that's a different matter)

One can fall back on older methodologies and measuring overlap by running the same query on GS and Summon, but this has drawbacks described as "bias and dependence" issues. 

Boolean versus ranked retrieval - clarified thoughts

My last blog post Why Nested Boolean search statements may not work as well as they did was pretty popular but what I didn't realise that I was implicitly saying that relevance ranking of documents retrieved using Boolean operators did not generally work well.

This was pointed out by Jonas 



I tweeted back asking why we couldn't have good ranked retrieval on documents retrieved using Boolean operators and he replied that he thinks it's based two different mindsets and one should either "trust relevance or created limited sets."

On the opposite end, Dave Pattern of Huddersfield reminded me that Summon's relevancy ranking was based on Open Source Lucene software with some amount of tweaking. You can find some details  but essentially it is designed to combine Boolean with Vector Space models etc aka it is designed or can do Boolean + ranked retrieval.

After reading though some documentation and the excellent Boolean versus ranked querying for biomedical systematic reviews, I realized my thinking on this topic was somewhat unclear.

As a librarian, I have always assumed it makes too much sense to (1) Pull out possibly relevant articles using Boolean Operators (2) Rank them using various techniques from classic tf-idf factors to other more modern techniques like link popularity etc.

I knew of course, there were 2 paradigms, that the classic Boolean set retrieval assumed every result was "relevant" and did not bother with ranking beyond sorting by date etc. But it still seemed odd to me not to try to at least to add ranking. What's the harm right?

The flip side was, what is ranked retrieval by itself? If one entered SINGAPORE HISTORICAL BUILDINGS ARCHITECTURE, it would still be ranking all documents that had all 4 terms right?(maybe with stemming) or wasn't it really still Boolean with ranking?

The key I was missing which now seemed obvious is that for ranked retrieval paradigms not every search term in the query has to be matched.

I know those knowledgeable in information retrieval reading this might think this be obvious and I am dense for not realizing this. I guess I did know this except I am a librarian, I am so trapped into Boolean thinking that I assume implicit AND is the rule.

In fact, we like to talk about how Google and some web searches do "Soft AND", and kick up a fuss when they might sometimes drop off one or more search terms. But in ranked retrieval that's what uou do, you throw in a "bag of words" (could be a whole paragraph of words), the ranking algorithms tries to do the best it can but the documents it fulls up may not have all the words in the query.

Boolean versus ranked querying for biomedical systematic reviews is particularly interesting paper, showing how different search algorithms ranging from straight out Boolean to ranked retrieval techniques that involve throwing in Title,abstracts as well as hybrid techniques that involve combining Boolean with Ranked retrieval techniques fare in term of retrieving clinical studies for systematic reviews.

It's a amazing paper, with different metrics and good explaintion of systematic reviews if you are unfamiliar. Particularly interesting they compare Boolean Lucene results which I think give you a hint on how Summon might fair.

The best algorithm for ranking might surprise you.... 



Read the full paper to understand the table! 


Large search index like Google Scholar, discovery service flatten knowledge but is that a good thing?

Like many librarians, I have an obsession on the size of databases, but is that really important?

Over at Library Babel Fish, Barbara Fister on the Library isn't flat, worries that academic libraries' discovery services are "are (once again) putting too high a value on volume of information and too little on curation".

 She ends with the following questions

"Is there some other way that libraries could enable discovery that is less flat, that helps make the communities of inquiry and the connections between ideas easier to follow? Is there a way to help people who want to join those conversations see the patterns and discern which ideas were groundbreaking and significant and which are simply filling in the details? Or is curation and connection too labor-intensive and inefficient for the globalized marketplace of ideas?"

Which makes the next section interesting....

Library Top Trends - Personally tuned discovery layers 

Ken Varnum at the recently concluded LITA Top Technology Trends Sessions certainly thinks that what is missing in current Library discovery services is the ability for librarians to provide personally tuned discovery layers for local use.

He would certainly think that there is value in librarians, slicing the collections into customized streams of knowledge to suit local conditions. You can jump to his section on this trend here. Also Roger Schonfeld's
section on Anticipatory discovery for current awareness of new publications is interesting as well.




To Barbara Fister's question on whether curation is too labour intensive or inefficient, Ken would probably answer no, and he suggests that in the future librarians can customize collections based on subject as well as appropriateness of use (e.g Undergraduate vs a Scholar).

It sounds like a great idea, since Summon and Ebscohost discovery layers currently provide hardcoded discipline sets and I can imagine eventually been able to create subject sets based on collections at the database and/or at the journal title levels (shades of the old federated search days or librarians creating google custom search engines eg one covering NGO Sites or Jurn (open access in humanities)).

At the even more granular level, I suppose one could also pull from reading lists etc.

Unlike Ken though I am not 100% convinced though it would just take "a little bit of work" to make this worth while or at least better than the hardcoded discipline sets. 


NISO Publishes Recommended Practice on Promoting Transparency in Library Discovery Services


NISO RP-19-2014, Open Discovery Initiative: Promoting Transparency in Discovery [PDF] was just published.

Somewhat related is the older NFAIS Recommended practices on Discovery Services [PDF]

I've gone through it as well as EBSCO supports recommendations of ODI press release and I am still digesting the implications, but clearly there is some disagreement about handling of A&I resources (not that shocking).

Discovery Tools, a Bibliography

Highly recommend resource - this is a bibliography by Fran├žois Renaville. Very comprehensive covering papers from 2010 onwards.

It is a duplicate of the Mendeley Group "Libraries & [Web-Scale] Discovery Tools.


Ebsco Discovery Layer related news

Ebsco has launched a blog "Discovery Pulse" with many interesting posts. Some tidbits

Note : I am just highlighting Ebsco items in this post because of their new blog as the blog may be of interest to readers. I would be happy to highlight Primo, Summon, WorldCat discovery service items when and if I become aware of them. 


Summon Integrates Flow research management tool.

It was announced that in July, Summon will integrate with Proquest Flow, their new cloud based reference management tool.


The word Login is extremely misleading in my opinion. 

I have very little information about this and how overt the integration will be. But given that Mendeley was acquired by Elsevier, Papers by Springer, it's no wonder that Proquest wants to get into the game as well.

It's all about trying to get into the researcher's workflow and unfortunately as increasingly "discovery happens elsewhere", so it would be smart to focus on reference management an area where currently the likes of Google seem to be ignoring (though moves like Scholar Library where one can add citations found in Google Scholar to your own personal library may say otherwise).

Mendeley for certain has shown that reference management is a very powerful place to start to get a digital foothold.

While it's still early days, currently Flow seems to have pretty much the standard features one sees in most modern reference managers eg. Free up to 2GB storage, support of Citation Style Language (CSL), capabilities for collaboration etc. I don't see any distinguishing features or unique angles yet.

Here's a comparison in terms of storage space for the major competitors such as Mendeley.

The webinar I attended on it (sorry don't have link to recording) suggests Proquest has big plans for Flow, beyond a reference manager. It will aim to support the whole research cycle, and I think this includes support as a staging ground for publication (submission to PQDT??), as well as support of prepub works (posting to Institutional or Subject repositories?).

It will be interesting to see if Proquest will try to leverage it's other assets such as Summon to support Flow. Eg. Would Proquest tie recommender services drawn from Summon usage into it?

Currently you can turn off Flow from Summon without much ill effects and it seems some libraries have done so because it may take time to evaluate and prepare staff to support this, but it remains to see if in the long run , if Flow might just have too many features and value to be turned off.

BTW If you want to keep up with articles, blog posts, videos etc on web scale discovery, do consider subscribing to my custom magazine curated by me on Flipboard (currently over 1,200 readers) or looking at the bibliography on web scale discovery services)


blog comments powered by Disqus

Share this!

Creative Commons License
This work is licensed under a Creative Commons Attribution-ShareAlike 3.0 Unported License.
Related Posts Plugin for WordPress, Blogger...