Sunday, July 11, 2010

Extracting metadata from pdfs - comparing EndNote,Mendeley,Zotero & WizFolio

Note :This was blogged in July 2010, since then most of the reference managers have improved substantially so the information here can be considered outdated.

Interest in reference managers is increasing due to the increased competition in the area. Martin Fenner is pretty well known for among many things his reference manager overview , while Owen Stephens over in the UK has organized two conferences so far with the title "Innovations in Reference Management"

The latest reference managers are not simple citation/reference managers, but try to take into account web2.0 trends, allowing sharing of references and recommending articles, taking into account workflows, (working with institutional repositories etc), but I think I'm most interested in the most basic functions of a reference manager.

Namely,  how easily does it allow you to import references?

There are many methods that can be used to add references to reference managers but recently EndNote X4
added the ability to add bibliographic data from the pdfs you download.

This feature which was already available for MendeleyZoteroWizfolio  is very useful for users who have a bunch of pdf articles but have being manually creating references in the past. To start using the reference manager, they just point the reference manager to the folders with the pdfs and you automatically get the correct bibliographic data with the pdfs linked to them.

It's not as easy as that of course. There are various methods to figure out the bibliographic data from the pdf article, from extracting metadata that Publishers have embedded into metadata (including doi) into the pdf, cross-checking with other databases such as Google Scholar (Zotero), PubMed (WizFolio) ,  or possibly some sort of crowdsourcing of the correct info (Mendeley). If all else fails, the reference manager can try to "guess" from the text based on location.

So how good are they at figuring out citations from pdfs?

I ran two simple non-scientific tests. First, I went to Scopus, searched for the term wikipedia, ranked the results by relevance. I then located the full-text via the "Find at publisher" button and downloaded the pdfs of the top 10 results into one folder

Here's the steps for each reference manager

EndNote X4 : Select File->Import->Folder

Mendeley Desktop Add documents -> Add folder - [edit] I was not logged into my Mendeley account when I did the test. The results might be a lot better when logged in.

Note: I did not use the optional function to search Google Scholar to pull in results.

Zoterio 2.0.3 : Link to file -> Retrieve metadata for PDF

WizFolio (version as of 11 July 2010) : Add -> Upload file . Note I did not use the optional "locate bibliography" function, but as it searches Pubmed, I doubt it would make much difference to the results.

Once the citation was obtained in the reference manager, I checked them manually either against the "official citation" from the publisher or by eyeballing the actual pdf. 

the Wikipedia for Reputation (ScienceDirect)
How and why
do college students use Wikipedia? (Wiley)
characteristics of wikipedia members (
Toward an
epistemology of Wikipedia  (Wiley)
concept thesaurus for better web media understanding (ACM DL)
news topic threads with wikipedia entries (IEEE)
wikipedia articles with semantic tags for structured
retrieval  (ACM DL)
Is Wikipedia
link structure different? (ACM DL)
importance of link evidence in Wikipedia (Springer book chapter)
the wikipedia phenomenon: A case for agent based modeling (ACM DL)
Total4 Pass, 1 Partial, 5
3 Pass, 7 Partial8 Pass, 2 Partial4 Pass, 1 Partial, 5

PASS = 5 main fields (see below) are correct
FAIL = No info in the 5 fields extracted or all wrong info
PARTIAL = At least 1 of the 5 fields are correct

The 5 main fields I checked were

1. Article title
2. Author
3. Publication Year
4. Journal or Conference & vol/issue
5. Page number

Initially, "Partial" was further qualified by which fields were missing, present/correct, present/wrong, present/incomplete but it got complicated fast! Since, I'm not writing a journal article, I'm going to keep things simple.

For our purposes here, "partial" means at least 1 field correct. 

Of course this  means "partial" could mean very different things. 

There are "partials" where almost every field is correct except for either a minor error in the title, or more often, lacking one field info (Source, Conference field typically). 

And there are "partials" where only the article title is correct, the rest could be missing or even wrong. 

Oh well as I said, this is a totally unscientific test. 

Comments on results

The video on new features in EndNote states that this features works only on official vendor pdfs, and pulls info via crossref/doi. In general EndNote plays it safe, unlike say Mendeley or WizFolio when it produces a citation, it almost always 100% correct. There's one exception, where it doesn't seem to handle conference proceeding properly, earning a "partial" because it thinks the citation is a journal, and hence lacks information for the conference. 


[edit] I was not logged into my Mendeley account when I did the test.  A quick retest shows that when logged in,  Mendeley seems to be automatically? pulling in data from other sources (crossref?), results will then be on par with Zotero in this test. Will investigate. What follows below is based on not being logged on. 
I'm not sure how Mendeley handles this, but my test of the 10 papers above, show that it is almost able to get something for sure though it's not 100% accurate, resulting in many "partials". This can vary from cases where it gets its almost completely correct except for the journal/conference field being wrong or missing, or to be almost totally incorrect, with many missing or error fields and the article title being correct only.

In this simple test, Zotero seems to do the best. It seems to be similar to EndNote extracting the metadata via DOI (There's one anomaly - the 4th title where it gets it almost right, except the title is messed up with crossref xml tags) , but it outdoes EndNote, by producing almost perfect bibliographic data for an additional 5 entries. 

The trick seems to be that Zotero pulls records from Google Scholar. Unlike EndNote, Zotero also seems to be able to recognize article types other than journal articles, correctly using conference paper types with the correct field info.

I get the impression WizFolio is using the exact same technique as EndNote, except for ones without dois, it tries to "guess" like Mendeley. It doesn't do so well, at best it gets the title right, but it almost always gives wrong year of publication, issue or paging when it tries.

In general, EndNote seems to be the baseline results. Zotero does the best because of Google Scholar intergretion. Mendeley and WizFolio gives mixed results, of the two Mendeley seems better but both sometimes give wrong information. In terms of publishers, Wiley and ScienceDirect articles seems to be the easiest to handle probably because metadata/doi is embedded in pdf? 

These are the "official" pdfs from the publisher, so in theory, figuring out the citations would be easy. The next simple test I did was to do a search in Google Scholar. The idea here was to see how well the reference managers did with conference proceedings, "unoffical" preprints etc.

EndNote Mendeley Zotero WizFolio
Wikipedia XML corpus (ACM DL)
semantic relatedness using wikipedia-based explicit semantic analysis
Wikipedia (ACM DL)

Computing Semantic Relatedness Using Wikipedia (
Wikipedia ( , preprint)
Total 5 Fail 1 Pass, 4 Partial 5 Pass 2 Partial, 3 Fail

As you can see, not very well. EndNote totally fails. While WizFolio and Mendeley barely fare better. Zotero is the star here, but again given the Google Scholar link this is not surprising.

When logged into my Mendeley, results are not very different, with 1 more PASS for 3rd item.


Let me stress again, this is a test I did out of curiosity. A sample of 15 articles clearly isn't enough to provide anything but my subjective impressions. For one thing it covers only articles mostly from ScienceDirect, Wiley and ACM Digital Library. Results will differ in particular if you get papers from say JSTOR.

Also due to the search topic used, understandably the articles pulled up are all new (2005 and after), the results will probably differ a lot if articles used were say from the 90s or even older, as the pdfs available would be different (no metadata embeded or worse just scanned pdfs).

I'm probably going to have to do another test with this, but the last time I tried, the hit rate was dismal for older articles even those in major databases like JSTOR. Another interesting avenue would be to look at stuff on preprint servers, open access servers etc.

Note it goes without saying, if I were doing medical articles and used Wizfolios "locate bibliography" function to clean up citations, I'm sure the results would be excellent.
blog comments powered by Disqus

Share this!

Creative Commons License
This work is licensed under a Creative Commons Attribution-ShareAlike 3.0 Unported License.
Related Posts Plugin for WordPress, Blogger...