acm-bib-napper

the story behind the bib-napper tools

acm-bib-napper is a tool i created to automatically extract complete bibtex repositories from the acm digital libary that includes the official bibtex for all papers with the abstract fields in the bibtex entry.

this is super useful for a number of reasons. for example, currently if you wanted to say have a single robust bibtex file for all of the isca papers published since 2000 you would have to manually click on each paper, then collect the single bibtex entry for each paper. if you then also wanted to have the “abstract =” field of the bibtex you would have to manually collect that as well for each paper and add it to its bibtex entry. this would be extremely tedious, practically intractable. thus i spent an afternoon reverse engineering how papers are organized in the acm digital library and generating a small python script that can easily do this work for you.

this is useful for any research who uses programs such as bibdesk, mendeley, mekentosj’s “papers” applications, and especially bibix. with bib-napper you can have a local searchable repository of all of the papers published at venues of interest, with the official acm bibtex information and the associated abstracts.

disclaimer: please do not abuse this tool, it’s relatively easy to modify it to automatically download all of the papers associated with each bib. this is not the intended purpose as it would generate a ton of traffic to the acm digital library. we love the acm digital library, so lets be nice to it. in addition, many of these bibtex archives have already been generated for you at the acm bib project (on this site), so you can simply snag the repos that are there.

download the tool (python script)

download the acm-bib-napper – naps bibtex from acm digital library with added abstracts

how to use / how it works

first you must make sure that you’ve selected “single page view mode” for the acm digital library, it will cache this setting in your browser, go to any paper’s page and click switch to single page view.

awesome! ok, now we need to select the list of journals/proceedings  that we want to generate the bibtex repo from. you can use an automatic downloader to grab the pages, or right click each one and select save as into a folder. i’ve used a firefox pluggin called “downthemall” that will let you select a region and download all of the linked files in this region. (dont worry, read on, you’ll see how this all works by the end)

i’ve used firefox’s downthemall plugin to download all selected journals/proceedings i care about, you can do this manually

now you should have all of the html files in a single directory, (make sure they are *.html files)

this is where my awesome script comes in, using a terminal app put my script in this directory and simply run it, it should look like this.

it will automatically detect all of the papers for each journal/proceeding and snag the official bibtex with the added corresponding abstracts and enter all bibs into a file called master.bib, if something goes wrong it will say “Bib-napping fail!” This should not happen though. When bib napping is complete you will see the following. update! – it also prints to the screen the bib its saving to master.bib.

the html files are automatically removed by bib-napper. ids.clump is essentially a dump of all the html files. if something fails you can call bib-napper again and it will try again using the ids.clump file. the master.bib file contains the complete repository of bibs with abstracts.

TADA! as you can see you have full bib entries with abstracts. note that the abstracts may have html tags in them here or there. they are benign though. when you import these bibs into mendeley they will appear but these are usually very few tags and don’t interfere with the searchability of the repository. if anyone wants to update the script to remove these tags feel free and shoot me an email.

if you like it, or it doesn’t work, or something shoot me and email.