DevOps Zone is brought to you in partnership with:

Mark is a graph advocate and field engineer for Neo Technology, the company behind the Neo4j graph database. As a field engineer, Mark helps customers embrace graph data and Neo4j building sophisticated solutions to challenging data problems. When he's not with customers Mark is a developer on Neo4j and writes his experiences of being a graphista on a popular blog at http://markhneedham.com/blog. He tweets at @markhneedham. Mark is a DZone MVB and is not an employee of DZone and has posted 534 posts at DZone. You can read more from them at their website. View Full User Profile

Stripping Out a Non-Breaking Space Character in Ruby

03.29.2013
| 2952 views |
  • submit to reddit

A couple of days ago I was playing with some code to scrape data from a web page and I wanted to skip a row in a table if the row didn’t contain any text.

I initially had the following code to do that:

rows.each do |row|
  next if row.strip.empty?
  # other scraping code
end

Unfortunately that approach broke down fairly quickly because empty rows contained a non breaking spacei.e. ‘ ’.

If we try called strip on a string containing that character we can see that it doesn’t get stripped:

# its hex representation is A0
> "\u00A0".strip
=> " "
> "\u00A0".strip.empty?
=> false

I wanted to see whether I could use gsub to solve the problem so I tried the following code which didn’t help either:

> "\u00A0".gsub(/\s*/, "")
=> " "
> "\u00A0".gsub(/\s*/, "").empty?
=> false

A bit of googling led me to this Stack Overflow post which suggests using the POSIX space character class to match the non breaking space rather than ‘\s’ because that will match more of the different space characters.

e.g.

> "\u00A0".gsub(/[[:space:]]+/, "")
=> ""
> "\u00A0".gsub(/[[:space:]]+/, "").empty?
=> true

So that we don’t end up indiscriminately removing all spaces to avoid problems like this where we mash the two names together…

> "Mark Needham".gsub(/[[:space:]]+/, "")
=> "MarkNeedham"

…the poster suggested the following regex which does the job:

> "\u00A0".gsub(/\A[[:space:]]+|[[:space:]]+\z/, '')
=> ""
> ("Mark" + "\u00A0" + "Needham").gsub(/\A[[:space:]]+|[[:space:]]+\z/, '')
=> "Mark Needham"
  • \A matches the beginning of the string
  • \z matches the end of the string

So what this bit of code does is match all the spaces that appear at the beginning or end of the string and then replaces them with ”.

Published at DZone with permission of Mark Needham, author and DZone MVB. (source)

(Note: Opinions expressed in this article and its replies are the opinions of their respective authors and not those of DZone, Inc.)