Case Insensitive Hpricot
So recently started dealing with Hpricot…what a mess, even tho this is supposed to be the end all be all of Ruby HTML parsers. My main issue is a complete lack of useful documentation. I ended up having to use some_element.methods.inspect to see what the hell my options were with a particular element, where I found the etags, which was what I needed to find.
Of course I wouldn’t have needed etags if Hpricot had an option to do case insensitive searches…like when I need to parse a document for the META info, I shouldn’t need to look for ‘meta’, ‘META’, ‘Meta’ and any other flavors that someone might have typed in. I know the ‘spec’ says this is the way it is supposed to work (case sensitive), but an HTML parser in particular needs to live in the real world.
Here is a method you can call like:
doc = normalize_hpricot(Hpricot(my_html))
#deal with hpricot case sensitivity
def normalize_hpricot(element)
element.children.each do |child|
if child.respond_to?(:etag=)
child.etag = child.etag.downcase if child.etag
end
if child.respond_to?(:raw_attributes=)
attribs = {}
begin
child.raw_attributes.each_pair do |key,value|
attribs[key.downcase] = value if value
end
child.raw_attributes = attribs
rescue
end
end
normalize_hpricot(child) if child.respond_to?(:children) and child.children
end
return element
endThis code was taken from http://davidsmalley.com/2008/4/24/hpricot-case-sensitivity and fixed to work and to not make everything lower case, just the tag names and the attribute names from the html tags.
Trackbacks
Use the following link to trackback from your own site:
http://blog.slaingod.com/trackbacks?article_id=case-insensitive-hpricot&day=05&month=04&year=2009
Thanks. This was driving me batty.
I use Hpricot to scrape from a collection of sites that all use the same CMS. You’d think I’d be safe where meta tag capitalization was concerned, seeing as how they all use the same templates. But about one time out of every seven or eight, my script would fall over on one meta tag. It took me a few go ‘rounds to realize that they’re using load balancers, and that at least of the load balancers for whatever reason has a set of templates that match everything identically except for the capitalization of the meta tag attributes.