Sunday, March 20, 2016

Ezpaarse - a easier way to analyze ezproxy logs?

I've been recently trying to analyse ezproxy logs for various reasons (eg. supplement vendor usage reports, cost allocation, studying library impacts etc) , and for those of you who have done so before, you will know it can be a pretty tricky task given the size of files involved.

In fact , there is nothing really special or difficult about ezproxy logs other than the size, a typical log will look something like

140.42.65.102 jRIuNWHATOzYTCI p9234212-1503252-1-0 [17/May/2011:10:01:44 
+1000] "GET
http://heinonline.org:80/HOL/ajaxcalls/get-section-id?base=js&handle=hein.journals/josf65&id=483
HTTP/1.1" 200 120

Your library may show slightly more details such as capturing user login information, user-agent (i.e type of browser) and Referrer (i.e the URL the user was on before).

In fact, you could even import this into Excel using space as a delimiter to get perfectly analyzable data. The main issue is you couldn't go very far doing this because Excel is limited to 1 million records only.

Overcoming the size issue

So one can't use Excel, what about exporting the data into a SQL database?

One idea is to use sed - a stream editor to convert the files into csv and import them into a SQL database (which is capable of managing a large number of records, though you may still come up against memory limits of your machine).

In any case I personally highly recommend sed, it is capable of finding, replacing and extracting even very large txt files in a efficient manner as it a stream editor. For example I can use it to go over 15 Gb of ezproxy logs to extract logs that contain a certain string (e.g. sciencedirect.com) in less than 10 minutes on a laptop with 4-8Gb of Ram.

I messed around with it for a day or two and find it relatively easy to use.

What if you don't want to use a SQL database and just want to quickly generate the statistics?

Typical most methods involve either

a) Working with some homebrew Perl or Python script - eg See ones shared by other libraries here or here

b) Using some standard weblog analyzer like Sawmill , Awsstats , analogx etc .

These can run through your logs can generate statistics on any reasonable machine.

Still too big? Another alternative is to do an analysis over so called SPU (start point URLs) , which basically only captures the very first time a user logins via ezproxy and creates a session. This results in much smaller files , depending on the size of your library you probably will be able to analyse it even in Excel.

You may have to set up your ezproxy configuration files to generate SPU logs as it is not logged by default.

Session based analysis

But regardless of the method I studied , I realize that fundamentally they gave the same results basically what I call sessions based analysis.

Example output from this script

These methods would tell you how many sessions were generated, and combined with the domains in the HTTP requests could tell you the number of sessions or users for each domain (say Scopus, or JSTOR)

But sometimes sessions alone or not enough, if you wanted more in depth analysis like the number of pdfs downloaded or page viewed from say Ebsco or Sciencedirect you are stuck.

The difficulty lies in the fact that it isn't always obvious from the HTTP request whether the user is requesting a download of a PDF or even if it is a html view from that platform.

Certainly if you wanted to you could do a quick adhoc analysis of the URLs for one or two platforms, but to do it for every platform you subscribed to (and most libraries subscribe to hundreds) would be a huge task especially if you started from the scratch.

Is there a better way?


Going beyond session based analysis with ezpaarse

What if I told you there was a free open source tool - ezpaarse that already had URL patterns for parsing over 60 commonly subscribed library resources and could produce data rich reports like the ones below?



















Starting out with Ezpaarse

Ezpaarse comes in 2 versions, a local version you can host and run on your own servers and more interestingly a cloud based version.

The cloud based version is perfectly serviceable and great to use if you don't have resources or permission to run your own servers but obviously one must weight the risk of sending user data over the internet even if you trust the people behind ezpaarse. (The ezproxy log you upload to the cloud version doesn't seem to be secured I think)

One can reduce the risks by anonymizing IP address, masking emails, cleaning HTTP requests etc before sending it off to the cloud of course (I personally recommend using sed to clean the logs)


Choosing the right log format 

Your logs might be in slightly different formats , so the first step after you sign in you need to specify the format of your logs. You do so by clicking on the "Design my log format" tab, then throwing in a few lines of your logs to test.




If you are lucky, it may automatically recognise your log format, if not you need to specify the log format.

Typically you need to look into your ezproxy.config for the ezproxy log directive. Look for something like

LogFormat %h %l %u %t "%r" %s %b

If you did it correctly, it should interprete the sample lines nicely like this (scroll down)



If you are having problems getting this to work do let the people at Ezpaarse know, they will help you figure out the right syntax. My experience so far is they are very helpful.

In fact, for ease of reuse, the ezpaarse people have helped some institutions create preset parameters set already. Click on parameters



You can see some predefined paraemters for various institutions . They are mostly France and Europe in the screenshot, but as you scroll down you will see libraries from US, Australia are already included, showing that word is spreading of this tool.



You can look at other options, including the ability to email you when the process is complete but most intriguing to me is the ability to simulate COUNTER reports (JR1)


I haven't tried it yet but could be used to compare with vendor reports for a sanity check (differences are expected of course because of off-campus access etc).


Loading the file and analyzing

Once done the rest is simple. Just click on Logfiles tab and add the files you want to upload.




I haven't tried with huge files (e.g >4 Gb), so there may be file limits but it does seem to work for reasonably sized files as it seems to be reading line by line.



As the file is processed line by line you can see the number of platforms recognized and the accesses recorded so far. My own personal experience was on the logs occasionally choking on the first line and refusing to work, so it might be worth while clicking on system traces to see what error messages occur.

Downloading and reporting


Once the file is 100% processed you can just download the processed file.

It a simple file csv file where the data is divided or delimited by Semicolons that you can open with many tools such as Excel.



You can see the processed file below.



There are tons of information that ezpaarse managed to extract from the ezproxy log, including but not limited to

a) Platform
b) Resource type (Article, Book, Abstract, TOC etc)
c) File type (PDF, HTML, Misc)
d) Various identifiers - ISSN, DOIs, Subjects (extracted from DOIs) etc.
e) Geocoding - By country etc

It's not compulsory but you can also download the Excel template and load the processed file through it to generate many beautiful charts.











Some disadvantages of using Ezpaarse

When I got the cloud based version of Ezpaarse to work, I was amazed at how easy it was to create detailed information rich reports from my ezproxy logs.

Ezpaarse was capable at getting very detailed information that I wouldn't have thought it was possible. This is due to the very capable parsers build-in for each platform.

This also is it's weakness because ezpaarse will totally ignore lines in your logs for platforms it does not have any parsers.

You can see the current list of parsers available and ones that is currently worked on.

While over 60 platforms have parsers such as Wiley, T&F, Sciencedirect, Ebscohost etc, many popular ones such as Factiva, Ebrary, Westlaw, Lexisnexis , Marketline , Euromonitor are still not available though they are in progress.

Of course if you subscribe to obscure or local items chances of them been covered is nil unless you contribute a parser yourself.

Overall, it seems to me currently Ezpaarse has parsers on more traditional large journal publishers and fewer on business, law type databases. So institutions specializing in law or business may get lesser benefit from Ezpaarse.

In some ways, many of the parsers cover platforms that libraries typically get COUNTER statistics from, but ezproxy log analysis goes beyond simplistic COUNTER statistics allowing you to for example to consider other factors like user group, discipline etc as such data is available in your ezproxy logs.

A lot of the document is also in French but nothing Google translate can't handle.

Conclusion

Ezpaarse is a really interesting piece of software. The fact it is open source and allows other libraries to contribute to the project  without reinventing the wheel and create parasers for each platform is a potential game changer.

What do you think? I am a newbie at this ezproxy analysis with limited skills, do let me know what I have missed or misstated. Are there alternative ways to do this?

blog comments powered by Disqus

Share this!

Creative Commons License
This work is licensed under a Creative Commons Attribution-ShareAlike 3.0 Unported License.
Related Posts Plugin for WordPress, Blogger...