Project frozen Project unmaintained

PyScholar

A "supervised" parser for Google Scholar, written in Python.

PyScholar is a command line tool written in python that implements a querier and parser for Google Scholar's output. This project is inspired by scholar.py, in fact there is a lot of code from that project, the main difference is that scholar.py makes use of the urllib modules, thus, so no javascript, and given that people at big G don't like you to scrape their search results, when the server responses the "I'm not a robot" page, you simply get no output from scholar.py, for a long time. Instead PyScholar makes use of selenium webdriver giving the ability to see what's going on and in case the "I'm not a robot" shows up you can simply pass the challenge manually and let the scraper continue his job.

Also there are some other new features I inclulded from my scholar.py fork, that are: json exporting of the reults, "starting result" option, and the potential ability to get an unlimited number results, even if it seems that results are limited on server-side to approximately one thousand.

I also removed Python 3 support, sorry.

I include here the original scholar.py's README.md content, changelog and license (change "scholar.py" with "pyscholar.py" in the commands below in order to make it work):

scholar.py is a Python module that implements a querier and parser for Google Scholar's output. Its classes can be used independently, but it can also be invoked as a command-line tool.

The script used to live at http://icir.org/christian/scholar.html, and I've moved it here so I can more easily manage the various patches and suggestions I'm receiving for scholar.py. Thanks guys, for all your interest! If you'd like to get in touch, email me at christian@icir.org or ping me on Twitter.

Cheers,
Christian

Features

Note

I will always strive to add features that increase the power of this API, but I will never add features that intentionally try to work around the query limits imposed by Google Scholar. Please don't ask me to add such features.

Examples

Try scholar.py --help for all available options. Note, the command line arguments changed considerably in version 2.0! A few examples:

Retrieve one article written by Einstein on quantum theory:

$ scholar.py -c 1 --author "albert einstein" --phrase "quantum theory"
         Title On the quantum theory of radiation
           URL http://icole.mut-es.ac.ir/downloads/Sci_Sec/W1/Einstein%201917.pdf
          Year 1917
     Citations 184
      Versions 3
    Cluster ID 17749203648027613321
      PDF link http://icole.mut-es.ac.ir/downloads/Sci_Sec/W1/Einstein%201917.pdf
Citations list http://scholar.google.com/scholar?cites=17749203648027613321&as_sdt=2005&sciodt=0,5&hl=en
 Versions list http://scholar.google.com/scholar?cluster=17749203648027613321&hl=en&as_sdt=0,5
       Excerpt The formal similarity between the chromatic distribution curve for thermal radiation [...]

Note the cluster ID in the above. Using this ID, you can directly access the cluster of articles Google Scholar has already determined to be variants of the same paper. So, let's see the versions:

$ scholar.py -C 17749203648027613321
         Title On the quantum theory of radiation
           URL http://icole.mut-es.ac.ir/downloads/Sci_Sec/W1/Einstein%201917.pdf
     Citations 184
      Versions 0
    Cluster ID 17749203648027613321
      PDF link http://icole.mut-es.ac.ir/downloads/Sci_Sec/W1/Einstein%201917.pdf
Citations list http://scholar.google.com/scholar?cites=17749203648027613321&as_sdt=2005&sciodt=0,5&hl=en
       Excerpt The formal similarity between the chromatic distribution curve for thermal radiation [...]

         Title ON THE QUANTUM THEORY OF RADIATION
           URL http://www.informationphilosopher.com/solutions/scientists/einstein/1917_Radiation.pdf
     Citations 0
      Versions 0
      PDF link http://www.informationphilosopher.com/solutions/scientists/einstein/1917_Radiation.pdf
       Excerpt The formal similarity between the chromatic distribution curve for thermal radiation [...]

         Title The Quantum Theory of Radiation
           URL http://web.ihep.su/dbserv/compas/src/einstein17/eng.pdf
     Citations 0
      Versions 0
      PDF link http://web.ihep.su/dbserv/compas/src/einstein17/eng.pdf
       Excerpt 1 on the assumption that there are discrete elements of energy, from which quantum [...]

Let's retrieve a BibTeX entry for that quantum theory paper. The best BibTeX often seems to be the one linked from search results, not those in the article cluster, so let's do a search again:

$ scholar.py -c 1 --author "albert einstein" --phrase "quantum theory" --citation bt
@article{einstein1917quantum,
  title={On the quantum theory of radiation},
  author={Einstein, Albert},
  journal={Phys. Z},
  volume={18},
  pages={121--128},
  year={1917}
}

Report the total number of articles Google Scholar has for Einstein:

$ scholar.py --txt-globals --author "albert einstein" | grep '\[G\]' | grep Results
[G]    Results 4190

ChangeLog

License

scholar.py is using the standard BSD license.