Scraping website using Python, Selenium, Lxml and PhantomJS

In this post I’m going to show a basic example of scraping website using Python with the headless browser PhantomJS. In other words, I’m going to automatize the extraction information process from a website using a browser that doesn’t have/need an user interface.

  • The easiest way to work with Python is using virtual environments with virtualenv. In Linux (Debian in my case) insert the following commands to install it.

    Now, go to a directory of your choice, then you must create and set the new virtual environment with the following commands.
  • First test. We need a couple of dependencies for doing the scraping, selenium and lxml. To do that, type down the following commands  pip install selenium and  pip install lxml inside of our virtual environment. If you have problems installing lxml is because you need some dependencies. So you need to erase the virtual environment that you have just created.

    After that you must install the following dependencies, they’re necessaries for compile the lxml module.

    Once you’ve installed it, create again the virtual environment like we did before, and inside of it execute the installation of lxml. If everything going well you should see something like  Successfully installed lxml . Now we are going to test the following code:

    Save it into a file named “test1.py” and execute it inside the virutal environment venv with  python test1.py, you must get the following exit:
  • Second test. Firstly we are going to install PhantomJS following the instructions in phantomjs.org, we create a folder for it and execute

    Now we’re going to modify the source code of the previous example by changing the instantiation of the browser. Firefox() by PhantomJS(). Beside I’ve specified a dimension for the window browser. If the website have a responsive desing maybe you are interest only in some data for one resolution.

    Save the source code into a file “test2.py” and execute it inside the virtual environment with the following command  python test2.py and you must get the same exit but without open a window browser.

 

References

3 thoughts on “Scraping website using Python, Selenium, Lxml and PhantomJS

Leave a Reply

Your email address will not be published. Required fields are marked *