What Is Web Scraping?


Web scraping is an approach for extracting data and information from websites that do not have an API. Web scraping automatically extracts data and presents it in a format a user can easily make sense of.

This tutorial will show you how to scrape websites with Ruby and Headless Chrome, using Selenium WebDriver.

Prerequisites

Selenium WebDriver

Open the command prompt on your operating system and execute the command below.

gem install selenium-webdriver

(If you already have the gem installed make sure you have version 3.4.1 or higher.)

ChromeDriver

Download and install the latest version of ChromeDriver

Unzip Chromedriver and add the executable to /usr/local/bin

Or if using Homebrew.

brew install chromedriver

Below is a list of finders used with Ruby and Selenium WebDriver. For the examples, I will mostly be using :class but any of the finders can be used if they are present in the websites HTML you are scraping.

FINDERS =

  :class             => 'ClassName',
  :class_name        => 'ClassName',
  :css               => 'CssSelector',
  :id                => 'Id',
  :link              => 'LinkText',
  :link_text         => 'LinkText',
  :name              => 'Name',
  :partial_link_text => 'PartialLinkText',
  :tag_name          => 'TagName',
  :xpath             => 'Xpath'

Extracting a Web Page Title and URL

require 'selenium-webdriver'

options = Selenium::WebDriver::Chrome::Options.new(args: ['headless'])

driver = Selenium::WebDriver.for(:chrome, options: options)

driver.get('http://www.bbc.co.uk/')

puts driver.title

puts driver.current_url

driver.quit 

The first example script demonstrates the fundamentals of web scraping. Navigating to the website and extracting data from the HTML Using the #title and #current_url methods to return the page title ‘BBC - Home’ and the URL ‘https://www.bbc.co.uk/’

Copy and paste the code and try changing the URL to a different website. To see other titles and the URL.

Extracting Data

This next example scrapes the title of the most recent post on the Guardian Football League blog.

require 'selenium-webdriver'

options = Selenium::WebDriver::Chrome::Options.new(args: ['headless'])

driver = Selenium::WebDriver.for(:chrome, options: options)

driver.get('https://www.theguardian.com/football/football-league-blog')

element = driver.find_element(:class, 'js-headline-text')

puts element.text.strip

driver.quit

The #find_element method selects a single element with the given class. In the example above this syntax is written as

element = driver.find_element(:class, 'js-headline-text')

The #text method returns the text content inside the element, in this case the title of the latest blog entry.

But what if we wanted to return all blog entries on the page with the class name of ‘js-headline-text’

require 'selenium-webdriver'

options = Selenium::WebDriver::Chrome::Options.new(args: ['headless'])

driver = Selenium::WebDriver.for(:chrome, options: options)

driver.get('https://www.theguardian.com/football/football-league-blog')

element = driver.find_elements(:class, 'js-headline-text')

element .each do |articles| 

puts articles.text.strip

end

driver.quit

The #find_elements method selects all elements on the page with the given class. In the example above this syntax is written as.

element = driver.find_elements(:class, 'js-headline-text')

Next we need to use the each iterator to loop through the page and return all elements with the class name of ‘js-headline-text’

element .each do |articles| 

puts articles.text.strip

Filling in a form

The next example shows how to fill out a search box and click a link. This script will search the imdb.com website for the film The Shawshank Redemption and return a list of the films cast and the character that they play.

require 'selenium-webdriver'

options = Selenium::WebDriver::Chrome::Options.new(args: ['headless'])

driver = Selenium::WebDriver.for(:chrome, options: options)

driver.get('http://www.imdb.com')

element = driver.find_element(name: 'q')

element.send_keys('shawshank redemption')

element.submit

driver.find_element(link_text: 'The Shawshank Redemption').click

puts driver.title

cast = driver.find_element(:class, 'cast_list')

puts cast.text.strip

driver.quit

More Than One Element

What if we needed to scrape more than one element and also group them together?

The next example shows how to handle more than one element and also merge them together using the .zip method.

The script will scrape the title and description of trophies for the Playstation 4 game The Witcher 3: Wild Hunt - Game Of The Year Edition and merge them together in a new array called data_sets using the .zip method

require 'selenium-webdriver'
options = Selenium::WebDriver::Chrome::Options.new(args: ['headless'])

driver = Selenium::WebDriver.for(:chrome, options: options)

driver.get('https://www.playstationtrophies.org/game/the-witcher-3-goty/trophies/')

puts driver.title

trophies = driver.find_elements(:class, 'link_ach')
descriptions = driver.find_elements(:class, 'ac3')
data_sets = trophies.zip(descriptions)

data_sets.each do |trophy, description|

puts trophy.text.strip, description.text.strip
end

driver.quit

Try it yourself

Give it a try. Write your own scripts using Ruby and Selenium WebDriver. Scraping is a great and easy way to learn coding skills.


Russell Morley

Software Developer In Test | Automation Enthusiast