Intro

Web scraping is really popular technique today, but sometimes there are several problems you have to handle properly, and in this post I will explain to you how to properly handle pagination when working with CasperJS.

If you just started with screen scraping then read “Why CasperJS is better than PhantomJS“. Also here are few good working examples like “How to login Amazon using CasperJS“, “How to login Amazon using PhantomJS” and “How to login Facebook using CasperJS“.

Problem

When you are dealing with web scraping then you have to handle big set of problems like pages behind login screens, cookies, AJAX requests, file download and lot of other problems including pagination. Pagination is present on almost every website today and that makes our life harder.

The problem with pagination is the fact that page numbers are generated dynamically, we don’t know how many pages will be generated, and we have to handle them all the same way. This is a perfect example for recursion, and that is what we are going to implement in order to process all pages on the given website.

Recursion in computer science is a method where the solution to a problem depends on solutions to smaller instances of the same problem (as opposed to iteration). The approach can be applied to many types of problems, and recursion is one of the central ideas of computer science.

One important thing about recursion is to define condition when recursion process will stop, otherwise there will be  an infinite recursion.

Resolution to the problem

Here is one example of pagination:

Pagination

 

When you open a page where pagination is implemented, then usually first page is selected. Usually, there is a Next button which increments page number. When the last page is selected, button Next is usually not displayed and that is a perfect condition for stopping recursion. Please note that pagination elements are different on different websites, and first task is to actually see how pagination is working and what is happening when you hit the last page (sometimes Next button is visible but not clickable, so testing CSS values in this case can be condition for stopping a recursion process).

Let us try graphically to explain our problem. Algorithm flow is displayed on the image below:

CasperJS - web scraping pagination

As you can see, when first page is opened, we will extract content, and then we will test weather we hit last page or not. If we hit a last page, then we have to finish the script. If we didn’t hit a last page, then open next page and do same steps again.

Now we can identify several functions which will be used in implementation:

Now let us define and implement every function.

 

Here is our processPage() function implementation.

Final solution

Here is final code which handles a pagination.

I intentionally didn’t want to write code for specific website because I wanted to share general idea of web scraping when you have to handle pagination. If you copy and past this example you can easily adjust it to your needs with really small changes in the getPageData() function and condition for terminating the script.

Important note: I have seen some implementations including for loops but that is bad approach because of asynchronous nature of JavaScript. This approach can give you bad and unwanted results.

Web Scraping with CasperJS - Handling Pagginationhttp://code-epicenter.com/wp-content/uploads/2015/10/web-scraping-casperjs-pagination.pnghttp://code-epicenter.com/wp-content/uploads/2015/10/web-scraping-casperjs-pagination-150x150.pngAmir DuranJavaScriptLibrariesProgrammingTutorialsCasperJS,pagination,PhanotmJS,Web scraping
Intro Web scraping is really popular technique today, but sometimes there are several problems you have to handle properly, and in this post I will explain to you how to properly handle pagination when working with CasperJS. If you just started with screen scraping then read 'Why CasperJS is better than PhantomJS'....