HTML5 Zone is brought to you in partnership with:

Mark is a graph advocate and field engineer for Neo Technology, the company behind the Neo4j graph database. As a field engineer, Mark helps customers embrace graph data and Neo4j building sophisticated solutions to challenging data problems. When he's not with customers Mark is a developer on Neo4j and writes his experiences of being a graphista on a popular blog at http://markhneedham.com/blog. He tweets at @markhneedham. Mark is a DZone MVB and is not an employee of DZone and has posted 543 posts at DZone. You can read more from them at their website. View Full User Profile

When Nokogiri fails: Web Driver to the rescue

03.26.2013
| 2596 views |
  • submit to reddit

As I mentioned in my previous post I wanted to add televised games to my football graph and the Premier League website seemed like the best case to find out which games those were.

I initially tried to use Nokogiri to grab the data that I wanted…

> require 'nokogiri'
> require 'open-air'
> tv_times = Nokogiri::HTML(open('http://www.premierleague.com/en-gb/matchday/broadcast-schedules.tv.html?rangeType=.dateSeason&country=GB&clubId=ALL&season=2012-2013&isLive=true'))

…but when I tried to query by CSS selector for all the matches nothing came back:

> tv_times.css(".broadcastschedule table.contentTable tbody tr")
=> []

I was a bit surprised but read somewhere that I should check if there were any errors while parsing the document. In fact there were quite a few!

> tv_times.errors
=> [#<Nokogiri::XML::SyntaxError: Element script embeds close tag>, #<Nokogiri::XML::SyntaxError: Element script embeds close tag>, #<Nokogiri::XML::SyntaxError: Element script embeds close tag>, #<Nokogiri::XML::SyntaxError: Element script embeds close tag>, #<Nokogiri::XML::SyntaxError: Element script embeds close tag>, #<Nokogiri::XML::SyntaxError: Element script embeds close tag>, #<Nokogiri::XML::SyntaxError: Element script embeds close tag>, #<Nokogiri::XML::SyntaxError: Element script embeds close tag>, #<Nokogiri::XML::SyntaxError: Element script embeds close tag>, #<Nokogiri::XML::SyntaxError: Element script embeds close tag>, #<Nokogiri::XML::SyntaxError: Element script embeds close tag>, #<Nokogiri::XML::SyntaxError: Element script embeds close tag>, ...]

I ran the document through the W3C markup validation service and it didn’t seem to find any problem with it.

Next I tried stripping out all the script tags using loofah before manually removing them but neither of those approaches helped.

I’ve previously used Web Driver to scrape web pages but I’d found that Nokogiri was much faster so I stopped using it.

Since my new library wasn’t playing ball I thought I’d quickly see if Web Driver was up to the challenge and indeed it was:

require "selenium-webdriver"
 
driver = Selenium::WebDriver.for :chrome
driver.navigate.to "http://www.premierleague.com/en-gb/matchday/broadcast-schedules.tv.html?rangeType=.dateSeason&country=GB&clubId=ALL&season=2012-2013&isLive=true"
 
matches = driver.find_elements(:css, '.broadcastschedule table.contentTable tbody tr')
matches.each do|tr| 	
  match = tr.find_element(:css, "td.show a").text
  broadcaster = tr.find_element(:css, "td.broadcaster img").attribute("src")
  tv_channel = broadcaster.include?("sky-sports") ? "Sky" : "ESPN"
 
  puts "#{match},#{tv_channel}"
end
 
driver.quit


$ ruby tv_games.rb 
Newcastle United vs Tottenham Hotspur,ESPN
Wigan Athletic vs Chelsea,Sky
Manchester City vs Southampton,Sky
Everton vs Manchester United,Sky
Swansea City vs West Ham United,Sky
Chelsea vs Newcastle United,ESPN
...

Ideally I’d like to use Nokogiri to do this job but it’s decided that the document is invalid and it can’t parse it properly so Web Driver is a pretty decent replacement I reckon!



Published at DZone with permission of Mark Needham, author and DZone MVB. (source)

(Note: Opinions expressed in this article and its replies are the opinions of their respective authors and not those of DZone, Inc.)