What Is Web Scraping?
Web scraping is an approach for extracting data and information from websites that do not have an API. Web scraping automatically extracts data and presents it in a format a user can easily make sense of.
This tutorial will show you how to scrape websites with Ruby and Headless Chrome, using Selenium WebDriver.
Prerequisites
Selenium WebDriver
Open the command prompt on your operating system and execute the command below.
gem install selenium-webdriver
(If you already have the gem installed make sure you have version 3.4.1 or higher.)
ChromeDriver
Download and install the latest version of ChromeDriver
Unzip Chromedriver and add the executable to /usr/local/bin
Or if using Homebrew.
brew install chromedriver
Below is a list of finders used with Ruby and Selenium WebDriver. For the examples, I will mostly be using :class but any of the finders can be used if they are present in the websites HTML you are scraping.
FINDERS =
:class => 'ClassName',
:class_name => 'ClassName',
:css => 'CssSelector',
:id => 'Id',
:link => 'LinkText',
:link_text => 'LinkText',
:name => 'Name',
:partial_link_text => 'PartialLinkText',
:tag_name => 'TagName',
:xpath => 'Xpath'
Extracting a Web Page Title and URL
require 'selenium-webdriver'
options = Selenium::WebDriver::Chrome::Options.new(args: ['headless'])
driver = Selenium::WebDriver.for(:chrome, options: options)
driver.get('http://www.bbc.co.uk/')
puts driver.title
puts driver.current_url
driver.quit
The first example script demonstrates the fundamentals of web scraping. Navigating to the website and extracting data from the HTML Using the #title and #current_url methods to return the page title ‘BBC - Home’ and the URL ‘https://www.bbc.co.uk/’
Copy and paste the code and try changing the URL to a different website. To see other titles and the URL.
Extracting Data
This next example scrapes the title of the most recent post on the Guardian Football League blog.
require 'selenium-webdriver'
options = Selenium::WebDriver::Chrome::Options.new(args: ['headless'])
driver = Selenium::WebDriver.for(:chrome, options: options)
driver.get('https://www.theguardian.com/football/football-league-blog')
element = driver.find_element(:class, 'js-headline-text')
puts element.text.strip
driver.quit
The #find_element method selects a single element with the given class. In the example above this syntax is written as
element = driver.find_element(:class, 'js-headline-text')
The #text method returns the text content inside the element, in this case the title of the latest blog entry.
But what if we wanted to return all blog entries on the page with the class name of ‘js-headline-text’
require 'selenium-webdriver'
options = Selenium::WebDriver::Chrome::Options.new(args: ['headless'])
driver = Selenium::WebDriver.for(:chrome, options: options)
driver.get('https://www.theguardian.com/football/football-league-blog')
element = driver.find_elements(:class, 'js-headline-text')
element .each do |articles|
puts articles.text.strip
end
driver.quit
The #find_elements method selects all elements on the page with the given class. In the example above this syntax is written as.
element = driver.find_elements(:class, 'js-headline-text')
Next we need to use the each iterator to loop through the page and return all elements with the class name of ‘js-headline-text’
element .each do |articles|
puts articles.text.strip
Filling in a form
The next example shows how to fill out a search box and click a link. This script will search the imdb.com website for the film The Shawshank Redemption and return a list of the films cast and the character that they play.
require 'selenium-webdriver'
options = Selenium::WebDriver::Chrome::Options.new(args: ['headless'])
driver = Selenium::WebDriver.for(:chrome, options: options)
driver.get('http://www.imdb.com')
element = driver.find_element(name: 'q')
element.send_keys('shawshank redemption')
element.submit
driver.find_element(link_text: 'The Shawshank Redemption').click
puts driver.title
cast = driver.find_element(:class, 'cast_list')
puts cast.text.strip
driver.quit
More Than One Element
What if we needed to scrape more than one element and also group them together?
The next example shows how to handle more than one element and also merge them together using the .zip method.
The script will scrape the title and description of trophies for the Playstation 4 game The Witcher 3: Wild Hunt - Game Of The Year Edition and merge them together in a new array called data_sets using the .zip method
require 'selenium-webdriver'
options = Selenium::WebDriver::Chrome::Options.new(args: ['headless'])
driver = Selenium::WebDriver.for(:chrome, options: options)
driver.get('https://www.playstationtrophies.org/game/the-witcher-3-goty/trophies/')
puts driver.title
trophies = driver.find_elements(:class, 'link_ach')
descriptions = driver.find_elements(:class, 'ac3')
data_sets = trophies.zip(descriptions)
data_sets.each do |trophy, description|
puts trophy.text.strip, description.text.strip
end
driver.quit
Try it yourself
Give it a try. Write your own scripts using Ruby and Selenium WebDriver. Scraping is a great and easy way to learn coding skills.