Web scraping isn’t new. However, the technologies and techniques that are used to power websites are developing at a rapid speed. A lot of websites use front-end frameworks like Vuejs, react, or angular, that load (partial) content asynchronously via javascript. Hence, this content can only be fetched if the page is opened in a web browser. In this tutorial, you will build a web scraping application using Node.js and Puppeteer. Your app will grow in complexity as you progress. First, you will code your app to open Chromium and load a special website designed as a web-scraping sandbox: bo. Web scraping can be done in virtually any programming language that has support for HTTP and XML or DOM parsing. In this tutorial, we will focus on web scraping using JavaScript in a Node.js server environment. With that in mind, this tutorial assumes that readers know the following: Understanding of JavaScript and ES6 and ES7 syntax.
Puppeteer is an API that lets you manage the Chromium Browser with code written in NodeJs. And the cool part about this is that Web Scraping with Puppeteer is very easy and beginner friendly. Even beginners of Javascript can start to web scrape the web with Puppeteer because of it’s simplicity and because it is straight forward. JavaScript + Node JS. Node JS is the back end version of JavaScript. If you're not familiar with it, we'll set it up together. Twitter + Github Jobs API. Getting familiar with these two APIs allows you to build a ton of cool stuff. That's exactly what we'll do. Google Sheets API. Our database for the course.
Web scraping is the easiest way to automate the process of extracting data from any website. Puppeteer scrapers can be used when a normal request module based scraper is unable to extract data from a website.
What is Puppeteer?
Puppeteer is a node.js library that provides a powerful but simple API that allows you to control Google’s Chrome or Chromium browser. It also allows you to run Chromium in headless mode (useful for running browsers in servers) and can send and receive requests without the need for a user interface. It works in the background, performing actions as instructed by the API. The developer community for puppeteer is very active and new updates are rolled out regularly. With its full-fledged API, it covers most actions that can be done with a Chrome browser. As of now, it is one of the best options to scrape JavaScript-heavy websites.
What can you do with Puppeteer?
Puppeteer can do almost everything Google Chrome or Chromium can do.
- Click elements such as buttons, links, and images.
- Type like a user in input boxes and automate form submissions
- Navigate pages, click on links, and follow them, go back and forward.
- Take a timeline trace to find out where the issues are in a website.
- Carry out automated testing for user interfaces and various front-end apps, directly in a browser.
- Take screenshots and convert web pages to pdf’s.
Web Scraping using Puppeteer
In this tutorial, we’ll show you how to create a web scraper for Booking.com to scrape the details of hotel listings in a particular city from the first page of results. We will scrape the hotel name, rating, number of reviews, and price for each hotel listing.
Required Tools
To install Puppeteer you need to first install node.js and write the code to control the browser a.k.a scraper in JavaScript. Node.js runs the script and lets you control the Chrome browser using the puppeteer library. Puppeteer requires at least Node v7.6.0 or greater but for this tutorial, we will go with Node v9.0.0.
Installing Node.js
Linux
You can head over to Nodesource and choose the distribution you want. Here are the steps to install node.js in Ubuntu 16.04 :
1. Open a terminal run – sudo apt install curl
in case it’s not installed.
2. Then run – curl -sL https://deb.nodesource.com/setup_8.x | sudo -E bash -
3. Once that’s done, install node.js by running, sudo apt install nodejs
. This will automatically install npm.
Windows and Mac
To install node.js in Windows or Mac, download the package for your OS from Nodes JS’s website https://nodejs.org/en/download/
Obtaining the URL
Let’s start by obtaining the booking URL. Go to booking.com and search for a city with the inputs for check-in and check-out dates. Click the search button and copy the URL that has been generated. This will be your booking URL.
The gif below shows how to obtain the booking URL for hotels available in Singapore.
After you have completed the installation of node.js we will install the project requirements, which will also download the puppeteer library that will be used in the scraper. Download both the files app.js and package.json from below and place it inside a folder. We have named our folder booking_scraper.
The script below is the scraper. We have named it app.js. This script will scrape the results for a single listing page:
The script below is package.json which contains the libraries needed to run the scraper
Installing the project dependencies, which will also install Puppeteer.
- Install the project directory and make sure it has the
package.json
file inside it. - Use
npm install
to install the dependencies. This will also install puppeteer and download the Chromium browser to run the puppeteer code. By default, puppeteer works with the Chromium browser but you can also use Chrome.
Now copy the URL that was generated from booking.com and paste it in the bookingUrl variable in the provided space (line 3 in app.js). You should make sure the URL is inserted within quotes otherwise, the script will not work.
Running the Puppeteer Scraper
To run a node.js program you need to type:
For this script, it will be:
Turning off Headless Mode
The script above runs the browser in headless mode. To turn the headless mode off, just modify this line
const browser = await puppeteer.launch({ headless: true }); to const browser = await puppeteer.launch({ headless: false});
You should then be able to see what is going on.
The program will run and fetch all the hotel details and display it in the terminal. If you want to scrape another page you can change the URL in the bookingUrl variable and run the program again.
Here is how the output for hotels in Singapore will look like:
Debug Using Screenshots
In case you are stuck, you could always try taking a screenshot of the webpage and see if you are being blocked or if the structure of the website has changed. Here is something to get started
Learn More:
Speed Up Puppeteer Web Scraping
Loading a web page with images could slow down web scraping due to reduced page speed. To speed up browsing and data scraping, disabling CSS and images could help with that while also reducing bandwidth consumption.
Learn More:
Known Limitations
When using Puppeteer you should keep some things in mind. Since Puppeteer opens up a browser it takes a lot of memory and CPU to run in comparison to script-based approaches like Selenium for JavaScript.
If you want to scrape a simple website that does not use JavaScript-heavy frontends, use a simple Python Scraper. There are plenty of open source javascript web scraping tools you can try such as Apidfy SDK, Nodecrawler, Playwright, and more.
You will find Puppeteer to be a bit slow as it only opens one page at a time and starts scraping a page once it has been fully loaded. Pupetteer scripts can only be written in JavaScript and do not support any other language.
If you need professional help with scraping complex websites, contact us by filling up the form below.
We can help with your data or automation needs
Turn the Internet into meaningful, structured and usable data
Disclaimer:Any code provided in our tutorials is for illustration and learning purposes only. We are not responsible for how it is used and assume no liability for any detrimental usage of the source code. The mere presence of this code on our site does not imply that we encourage scraping or scrape the websites referenced in the code and accompanying tutorial. The tutorials only help illustrate the technique of programming web scrapers for popular internet websites. We are not obligated to provide any support for the code, however, if you add your questions in the comments section, we may periodically address them.
Following up on my popular tutorial on how to create an easy web crawler in Node.js I decided to extend the idea a bit further by scraping a few popular websites. For now, I'll just append the results of web scraping to a .txt file, but in a future post I'll show you how to insert them into a database.
Each scraper takes about 20 lines of code and they're pretty easy to modify if you want to scrape other elements of the site or web page.
Web Scraping Reddit
First I'll show you what it does and then explain it.
Web Scraping Node Js
It firsts visits reddit.com and then collects all the post titles, the score, and the username of the user that submitted each post. It writes all of this to a .txt file named reddit.txt
separating each entry on a new line. Alternatively it's easy to separate each entry with a comma or some other delimiter if you wanted to open the results in Excel or a spreadsheet.
Okay, so how did I do it?
Make sure you have Node.js and npm installed. If you're not familiar with them take a look at the paragraph here.
Open up your command line. You'll need to install just two Node.js dependencies. You can do that by either running
as shown below:
Alternate option to install dependencies
Another option is copying over the dependencies and adding them to a package.json
file and then running npm install
. My package.json
includes these:
The actual code to scrape reddit
Now to take a look at how I scraped reddit in about 20 lines of code. Open up your favorite text editor (I use Atom) and copy the following:
This is surprisingly simple. Save the file as scrape-reddit.js
and then run it by typing node scrape-reddit.js
. You should end up with a text file called reddit.txt
that looks something like:
which is the post title, then the score, and finally the username.
Web Scraping Hacker News
Let's take a look at how the posts are structured:
As you can see, there are a bunch of tr
HTML elements with a class of athing
. So the first step will be to gather up all of the tr.athing
elements.
We'll then want to grab the post titles by selecting the td.title
child element and then the a
element (the anchor tag of the hyperlink).
Note that we skip over any hiring posts by making sure we only gather up the tr.athing
elements that have a td.votelinks
child, as demonstrated in the following picture:
Here's the code
Run that and you'll get a hackernews.txt
file that looks something like:
First you have the title of the post on Hacker News and then the URL of that post on the next line. If you wanted both the title and URL on the same line, you can change the code:
Web Scraping Using Node Js
to something like:
Node Js Website Scraper
This allows you to use a comma as a delimiter so you can open up the file in a spreadsheet like Excel or a different program. You may want to use a different delimiter, such as a semicolon, which is an easy change above.
Web Scraping BuzzFeed
Run that and you'll get something like the following in a buzzfeed.txt
file:
Want more?
How To Start Node Js
I'll eventually update this post to explain how the web scraper works. Specifically I'll talk about how I chose the selectors to pull the correct content from the right HTML element. There are great tools that make this process very easy, such as Chrome DevTools that I use while I'm writing the web scraper for the first time.
I'll also show you how to iterate through the pages on each website to scrape even more content.
Finally, in a future post I'll detail how to insert these records into a database instead of a .txt file. Be sure to check back!
In the mean time, you may be interested in my tutorial on how to create a web crawler in Node.js / JavaScript.