Case Insensitive Hpricot

So recently started dealing with Hpricot…what a mess, even tho this is supposed to be the end all be all of Ruby HTML parsers. My main issue is a complete lack of useful documentation. I ended up having to use some_element.methods.inspect to see what the hell my options were with a particular element, where I found the etags, which was what I needed to find.

Of course I wouldn’t have needed etags if Hpricot had an option to do case insensitive searches…like when I need to parse a document for the META info, I shouldn’t need to look for ‘meta’, ‘META’, ‘Meta’ and any other flavors that someone might have typed in. I know the ‘spec’ says this is the way it is supposed to work (case sensitive), but an HTML parser in particular needs to live in the real world.

Here is a method you can call like:

doc = normalize_hpricot(Hpricot(my_html))

 #deal with hpricot case sensitivity
    def normalize_hpricot(element)
    element.children.each do |child|

      if child.respond_to?(:etag=)
        child.etag = child.etag.downcase if child.etag
      end
      if child.respond_to?(:raw_attributes=)
        attribs = {}

        begin
        child.raw_attributes.each_pair do |key,value|
          attribs[key.downcase] = value if value
        end
        child.raw_attributes = attribs
          rescue
          end
      end
      normalize_hpricot(child) if child.respond_to?(:children) and child.children
    end
    return element
  end

This code was taken from http://davidsmalley.com/2008/4/24/hpricot-case-sensitivity and fixed to work and to not make everything lower case, just the tag names and the attribute names from the html tags.

This entry was posted on Sun, 05 Apr 2009 13:19:00 GMT . You can follow any any response to this entry through the Atom feed. You can leave a comment or a trackback from your own site.


Trackbacks

Use the following link to trackback from your own site:
http://blog.slaingod.com/trackbacks?article_id=case-insensitive-hpricot&day=05&month=04&year=2009

Comments

Leave a response

  1. mph about 1 month later:

    Thanks. This was driving me batty.

    I use Hpricot to scrape from a collection of sites that all use the same CMS. You’d think I’d be safe where meta tag capitalization was concerned, seeing as how they all use the same templates. But about one time out of every seven or eight, my script would fall over on one meta tag. It took me a few go ‘rounds to realize that they’re using load balancers, and that at least of the load balancers for whatever reason has a set of templates that match everything identically except for the capitalization of the meta tag attributes.

Leave a comment