Monday, November 3, 2014

A script to fetch bibliography list from inspirehep by parsing through the latex file citing the references

While writing short scientific papers a major workload is in sorting the reference list. A professional article requires that the references be sorted in the bibliography list in the order that they appear in the main text (and not alphabetically which most LaTeX bst files do). This is a manual task which is laborious and needs to be repeated every time a reference is added. Therefore, it is instructive to write a script which will automate this tedious task. The following script that I wrote will look for the citation tags for the references and then fetch the references straight from the server at inspirehep.net and prepare a reference file for the user to copy. Not only do we no longer need to copy and paste the references manually but the script will also sort the references in the order of their appearance.

An example of a citation appearing in the main text

... as it was shown in \cite{Chatterjee:2013daa} that merging black holes may not violate the second law of thermodynamics .....

The script looks for the "\cite" tag. It will then extract all the cite tags into a single file. This is in the order in which the citations appear in the main text. Then it will remove all duplicate references. Then from that list it will fetch all the bibliography information from inspire.net and concatenate them into a file called "bibtex". The user then will copy the contents of the file "bibtex" to the end of his/her LaTeX file and compile.

For this script to work certain command line tools to be installed -- Perl (installed by default), lynx, curl. We can install them by

sudo apt-get install lynx curl

Save the following script, make it executable, and pass the latex file as argument to it.

  1. Save as getLatex.sh
  2. chmod 755 getLatex.sh
  3. ./getLatex.sh mynewpaper.tex


The output will be saved to the file bibtexfile.

Copy the contents of the file "bibtex" to the end of his/her LaTeX file and compile.

==================================================================

# This script fetches LaTeX bibtex from a LaTeX file and put them in the order in which they appear in the original document. Instead of doing it manually this script will sort the references.

# getLatex.sh v1.1 Jones
# Usage : getLatex.sh <filename>

echo "This script fetches LaTeX bibtex from a LaTeX file and put them in the order in which they appear in the original document. Instead of doing it manually this script will sort the references."

# check if lynx is installed
if [ `which lynx` == "" ] ; then 
   echo -e " <lynx> not installed. Install <lynx> for this script to work.\n  sudo apt-get install lynx.\n  Exiting ..."
   exit 1
fi
# check if curl is installed
if [ `which curl` == "" ] ; then 
   echo -e "  <curl> not installed. Install <curl> for this script to work.\n  sudo apt-get install curl.\n  Exiting ..."
   exit 1
fi

# check if user provided filename otherwise exit
if [ "$1" == "" ] ; then
   echo -e "  Did not supply a file name. \n  Usage: getLatex.sh <filename>.\n  Exiting ..." 
   exit
fi

# check if file exists
if [ ! -f $1 ]; then
    echo -e "  File not found.\n  Exiting ..."
    exit
fi

# begin by cleaning temporary files

for ff in .temp .temp2 .temp3 .temp4 .temp5 .temp6
do
if [ -f $ff ]; then
    rm $ff
fi
done

cat $1 | grep cite | sed 's/}/\n/g' | sed 's/cite/\ncite/g' | grep cite | sed 's/cite{//g' | sed 's/,/\n/g' | sed 's/ //g'  | perl -ne 'if (!defined $x{$_}) { print $_; $x{$_} = 1; }' > .temp


#cat $1 | grep ":" | grep -v "ARXIV"  | grep -v "=" | sed 's/@article{//' | sed 's/,//' | sed 's/}//' | sed 's/\.//' | sed "s/:/%3A/g" | sort -u > .temp


# see if the above code worked. If it did not then the format of the reference list in the file is wrong or the file does not contain references.
if [[ `cat .temp` == "" ]] ; then
  echo -e "  Enter the references in correct format in the file: $1. They should have the tag \\\cite.\n  Exiting ..."
  exit
fi

for i in `cat .temp`
do
echo "Getting --[ $i ]--"
#echo "http://inspirehep.net/search?ln=en&ln=en&p=$i&of=hb&action_search=Search&sf=&so=d&rm=&rg=25&sc=0" 
curl -# "http://inspirehep.net/search?ln=en&ln=en&p=$i&of=hb&action_search=Search&sf=&so=d&rm=&rg=25&sc=0" > .temp2
link2=`cat .temp2 | grep "LaTeX(US)"  | grep record | cut -d '"' -f2 | sed "s|http://inspirehep.net||" `

lynx -dump "http://inspirehep.net$link2" > .temp4
cat .temp4 | sed -n '/cite/,/HEP :: /p' > .temp5
cat .temp5 | grep -v "HEP" >> .temp6
echo "==========="
#exit
done
mv .temp6 bibtexfile

# end by cleaning temporary files

for ff in .temp .temp2 .temp3 .temp4 .temp5
do
if [ -f $ff ]; then
    rm $ff
fi
done

echo "  File written to bibtexfile."
=======================================================================

Note: This script will fetch only LaTeX bibliography entries. To fetch bibtex entries we need another script which I will publish next. However, using bibtex defeats the purpose of sorting the references in the order which they appear in the text because the references would be auto-sorted by the bibtex bst file. Most natbib bst files sorts the references alphabetically. Notable exception is the h-physrev.bst from arXiv.org which will sort the references in the order they appear in the document. So if we are using bibtex and want our references sorted in the order of their appearance in the main text we should use the h-physrev.bst file. But for small articles I prefer using LaTeX references rather than bibtex for simplicity. This is where this script comes in handy. However, script for fetching the bibtex is advantegous in its own right. Because while writing a thesis, we do not need to worry about manually copying the bibtex formatted references. We have to include our pre-written papers and run that script which will fetch the bibtex entries for the master.bib file. 

No comments:

OK GOOGLE on Samsung Galaxy S7 doesn’t work

To make Ok Google detection work on Galaxy S7 (Galaxy series phones) we need to perform a couple of steps. 1. As long as Samsung S vo...