Web Scraping with CasperJS – Handling Pagination
Intro
Web scraping is really popular technique today, but sometimes there are several problems you have to handle properly, and in this post I will explain to you how to properly handle pagination when working with CasperJS.
If you just started with screen scraping then read “Why CasperJS is better than PhantomJS“. Also here are few good working examples like “How to login Amazon using CasperJS“, “How to login Amazon using PhantomJS” and “How to login Facebook using CasperJS“.
Problem
When you are dealing with web scraping then you have to handle big set of problems like pages behind login screens, cookies, AJAX requests, file download and lot of other problems including pagination. Pagination is present on almost every website today and that makes our life harder.
The problem with pagination is the fact that page numbers are generated dynamically, we don’t know how many pages will be generated, and we have to handle them all the same way. This is a perfect example for recursion, and that is what we are going to implement in order to process all pages on the given website.
Recursion in computer science is a method where the solution to a problem depends on solutions to smaller instances of the same problem (as opposed to iteration). The approach can be applied to many types of problems, and recursion is one of the central ideas of computer science.
One important thing about recursion is to define condition when recursion process will stop, otherwise there will be an infinite recursion.
Resolution to the problem
Here is one example of pagination:
When you open a page where pagination is implemented, then usually first page is selected. Usually, there is a Next button which increments page number. When the last page is selected, button Next is usually not displayed and that is a perfect condition for stopping recursion. Please note that pagination elements are different on different websites, and first task is to actually see how pagination is working and what is happening when you hit the last page (sometimes Next button is visible but not clickable, so testing CSS values in this case can be condition for stopping a recursion process).
Let us try graphically to explain our problem. Algorithm flow is displayed on the image below:
As you can see, when first page is opened, we will extract content, and then we will test weather we hit last page or not. If we hit a last page, then we have to finish the script. If we didn’t hit a last page, then open next page and do same steps again.
Now we can identify several functions which will be used in implementation:
1 2 3 4 5 |
function getPageData()// In this function you will write code for extracting a content function terminate()//Function which will terminate our CasperJS script function processPage()//CasperJS function responsible for calling isTheLastPage() and getPageData() and for redirecting to a new page |
Now let us define and implement every function.
1 2 3 4 5 6 7 |
function getPageData(){ /* In this function you can put anything you want in order to extract your data from the website. NOTE: This function is executed in page context, and will should be called as parameter to Casper's evaluate function. */ return document.title;//For the sake of simplicity I will just extract website title and return it. } |
1 2 3 |
var stopScript = function() { this.echo("STOPPING SCRIPT").exit(); }; |
Here is our processPage() function implementation.
1 2 3 4 5 6 7 8 9 10 11 12 13 |
var processPage = function() { pageData = this.evaluate(getPageData);//getPageData is your function which will do data scraping from the page. If you need to extract data from tables, from divs write your logic in this function //If there is no nextButton on the page, then exit a script because we hit the last page if (this.exists("#nextButton") == false) { return terminate.call(casper); } //If script didn't finish, then click on the next button and go to process next page this.thenClick("#nextButton").then(function() { this.waitForSelector("#content",processPage, terminate); }); }; |
Final solution
Here is final code which handles a pagination.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 |
var casper = require("casper").create({ pageSettings: { userAgent: "Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/46.0.2490.71 Safari/537.36" } }); var url = 'Your URL';//Type your url casper.start(url);//Start CasperJS casper.waitForSelector('#content', processPage, stopScript);//Wait until content loads and then process the page casper.run(); var stopScript = function() { casper.echo("STOPPING SCRIPT").exit(); }; var processPage = function() { pageData = this.evaluate(getPageData);//getPageData is your function which will do data scraping from the page. If you need to extract data from tables, from divs write your logic in this function //If there is no nextButton on the page, then exit a script because we hit the last page if (this.exists("#nextButton") == false) { stopScript(); } //Click on the next button this.thenClick("#nextButton").then(function() { this.waitForSelector("#content",processPage, stopScript); }); }; function getPageData(){ /* In this function you can put anything you want in order to extract your data from the website. NOTE: This function is executed in page context, and will should be called as parameter to Casper's evaluate function. */ return document.title;//For the sake of simplicity I will just extract website title and return it. } |
I intentionally didn’t want to write code for specific website because I wanted to share general idea of web scraping when you have to handle pagination. If you copy and past this example you can easily adjust it to your needs with really small changes in the getPageData() function and condition for terminating the script.
Important note: I have seen some implementations including for loops but that is bad approach because of asynchronous nature of JavaScript. This approach can give you bad and unwanted results.
http://code-epicenter.com/web-scraping-with-casperjs-handling-pagination/Web Scraping with CasperJS - Handling Pagginationhttp://code-epicenter.com/wp-content/uploads/2015/10/web-scraping-casperjs-pagination.pnghttp://code-epicenter.com/wp-content/uploads/2015/10/web-scraping-casperjs-pagination-150x150.pngJavaScriptLibrariesProgrammingTutorialsCasperJS,pagination,PhanotmJS,Web scrapingIntro Web scraping is really popular technique today, but sometimes there are several problems you have to handle properly, and in this post I will explain to you how to properly handle pagination when working with CasperJS. If you just started with screen scraping then read 'Why CasperJS is better than PhantomJS'....Amir DuranAmir Duranamir.duran@gmail.comAdministratorAmir Duran is software engineer who currently lives and works in Germany. He obtained Masters degree diploma on Faculty of Electrical Engineering in Sarajevo, department Computer science. With good educational background he is specialized in designing and implementing a full-stack web based applications.Code Epicenter
Looks good Amir.
But can’t make it work with either phantomjs version 1.9.8 & CasperJS version 1.1.3 (self tests OK) or phantomjs version 2.1.1 & CasperJS version 1.1.3 (self tests with some errors).
Looks like issue might be with waitFor() which times out even though the wait on element exists, so that the processPage function is not triggered.
Be interested in the versions you used for your example.
Wooww…
Thank you so much, dude!
Nice work!