Last few days I worked on the project where I used web scrapping technique. Web scraping (web harvesting or web data extraction) is a computer software technique of extracting information from websites. This can be really helpful in one of the following scenarios:

  • You want to harvest, analyze, and print data coming from different websites to one single report or file
  • You need to extract some information from website, but website doesn’t provide API or any other service
  • You want to automatize some process on specific website
  • You want to perform testing on your website

I decided to use PhantomJS library to perform web scrapping. PhantomJS is a headless WebKit scriptable with a JavaScript API. It has fast and native support for various web standards: DOM handling, CSS selector, JSON, Canvas, and SVG.

Easily said, PhantomJS is normal browser running WebKit engine without user interface. When you load page using PhantomJS you can parse DOM elements, you can inject your own JavaScript code you can extract cookies, you can send POST and GET requests etc.

In this post I will write about scenario when you want to open website A and extract some content, then page B and extract other content etc. The second and more advance scenario is when you want to have ability to login to specific website using PhantomJS and extract some content.

How to install PhantomJS

Before we start with code examples, let me just say few words about how to install PhantomJS. From PhantomJS official website download library for your operating system and save it on your computer. If you are using Windows 7, just download .rar file, extract bin folder to your Desktop or any other folder. Inside bin folder, there is phantomjs.exe file, which is phantom executable.

In the same folder where have you saved .exe file, you will write your PhantomJS scripts. For testing purposes, create new myScript.js file and copy and paste the following code:

In order to run your PhantomJS script, open CMD, navigate to folder where phantomjs.exe file is saved and execute the following command: phantomjs.exe myScript.js.

Web scrapping with PhantomJS

In this scenario we will open website A, extract some content from it, then we will open website B, and extract other content and so on. I will start with easy example:

Example above is extremely easy. We open Code Epicenter and we extract title from page.

PhantomJS has one handy function called evaluate which enables you to inject your javascript code into website and evaluate it. This can be helpful when you want to extract some information from website, or when you want for example submit a form. Here is example:

The same way you can extract information inside evaluate function and then return it back to phantom script. For example:

Please note that inside evaluate() function you are not able to access phanotm, page or other phanotm objects because this code is executed in page context, not phantomjs context.

Note: The arguments and the return value to the evaluate function must be a simple primitive object. The rule of thumb: if it can be serialized via JSON, then it is fine.

Sometimes it is really helpful to inject some data from phantom script to evaluate() function. For example, you don’t want to hard code data in phantom script itself, because you want to read data from file or to pass data as system argument. If you need this, just copy the following function to your script:

Now you can do something like this:

So far, we learned how to open a page, and extract some content from it. But we want more, we want to open page A, then B, then C etc.

As I said at beginning, PhantomJS is just normal browser which means we have to wait for page to load, just like in normal browser. Keep this in mind because sometimes you can get empty content because you didn’t wait for page to load.

PhantomJS modules

Before we continue, let’s talk about PhantomJS modules. Module is just piece of code specialized to do something. PhantomJS provides several modules:

  • Command Line Interface
  • phantom Object
  • Web Page Module
  • Child Process Module
  • File System Module
  • System Module
  • Web Server Module

To use model inside PhantomJS script use require function:

Line above imports webpage module, and then create instance of it. We could write something like this:

Our question is how to create our own modules, and then how to reuse them across several projects? For example, I created Request module, which helps me to send GET requests to website. Here is a code:

In order to import this module in your project just write this line in your phantom script:

Using modules can be very handy when you want to create your own cookie management, or file management system which has to be specific for your application.

Final solution

Now let me reorganize code and paste the final solution. In final solution I used my DRequest module to send GET request to websites. Also, I added a small piece of code which ensures that our website is fully loaded (we do not want partiall content).

You are asking yourself what is this? Let me try to explain. In steps[] array, you define your actions to perform. Actions will be performed one by one starting from first action, which is exactly what we want. In our example, first action is to open website. The phantom script will not go to the second step until page is fully loaded which is important, because we don’t want partially loaded websites. When the website is fully loaded, step 2 is executed. In step 2 we just save loaded website to file.

Now, if you want to add your specific steps, you just have to add new elements in steps[] array. Here is example:

Now we have four steps instead just two. Here are the steps:

  1. Get http://photo-epicenter
  2. Save website to file
  3. Get
  4. Save website to file

Note: The final script is relied on the StackOwerflow answer.

Web scrapping with PhantomJS DuranJavaScriptLibrariesProgrammingJavaScript,Libraries,PhantomJSIntro Last few days I worked on the project where I used web scrapping technique. Web scraping (web harvesting or web data extraction) is a computer software technique of extracting information from websites. This can be really helpful in one of the following scenarios: You want to harvest, analyze, and print...The point is to understand