In this post I’m going to show a basic example of scraping website using Python with the headless browser PhantomJS. In other words, I’m going to automatize the extraction information process from a website using a browser that doesn’t have/need an user interface.
- The easiest way to work with Python is using virtual environments with virtualenv. In Linux (Debian in my case) insert the following commands to install it.
Shell123sudo apt-get install python-pip python-devsudo pip install --upgrade pipsudo pip install --upgrade virtualenv
Now, go to a directory of your choice, then you must create and set the new virtual environment with the following commands.
12virtualenv --distribute venvsource venv/bin/activate
- First test. We need a couple of dependencies for doing the scraping, selenium and lxml. To do that, type down the following commands
pip install selenium and
pip install lxml inside of our virtual environment. If you have problems installing lxml is because you need some dependencies. So you need to erase the virtual environment that you have just created.
Shell12source ~/.bashrcrm -rf venv
After that you must install the following dependencies, they’re necessaries for compile the lxml module.
Shell123456sudo apt-get install libxml2-devsudo apt-get install libxslt1-devsudo apt-get install python-libxml2sudo apt-get install python-libxslt1sudo apt-get install python-setuptoolssudo apt-get install zlib1g-dev
Once you’ve installed it, create again the virtual environment like we did before, and inside of it execute the installation of lxml. If everything going well you should see something like Successfully installed lxml . Now we are going to test the following code:
Python1234567891011import lxml.html as lhfrom selenium import webdriverbrowser = webdriver.Firefox()browser.get('http://commons.wikimedia.org/wiki/File%3aBrad_Pitt_Cannes_2011.jpg')content = browser.page_sourcebrowser.quit()doc = lh.fromstring(content)for elt in doc.xpath('//span[a[contains(@title,"Use this file")]]/text()'):print elt
Save it into a file named “test1.py” and execute it inside the virutal environment venv with python test1.py, you must get the following exit:
Shell12on the webon a wiki
- Second test. Firstly we are going to install PhantomJS following the instructions in phantomjs.org, we create a folder for it and execute
123wget https://phantomjs.googlecode.com/files/phantomjs-1.9.2-linux-x86_64.tar.bz2tar -xjvf phantomjs-1.9.2-linux-x86_64.tar.bz2sudo cp phantomjs-1.9.2-linux-x86_64/bin/phantomjs /usr/bin/
Now we’re going to modify the source code of the previous example by changing the instantiation of the browser. Firefox() by PhantomJS(). Beside I’ve specified a dimension for the window browser. If the website have a responsive desing maybe you are interest only in some data for one resolution.
Python123456789101112import lxml.html as lhfrom selenium import webdriverbrowser = webdriver.PhantomJS()browser.set_window_size(1024, 768)browser.get('http://commons.wikimedia.org/wiki/File%3aBrad_Pitt_Cannes_2011.jpg')content = browser.page_sourcebrowser.quit()doc = lh.fromstring(content)for elt in doc.xpath('//span[a[contains(@title,"Use this file")]]/text()'):print elt
Save the source code into a file “test2.py” and execute it inside the virtual environment with the following command python test2.py and you must get the same exit but without open a window browser.