I have had to write an HTML parser a few times already and I’ve always used HTML::Parser. Because it’s simple and because I never bothered to do it anohter way. I always wanted to ideally do some parsing with XPath because that way I can easily find what I need with one or two queries. Unfortunately, XPath only works on properly formatted XML documents, so unless the page I am parsing is XHTML, I couldn’t do much.

I also wanted to somehow convert the HTML to properly formatted XHTML and use XPath with that. Unfortunately, I didn’t know an easy way to do that either. Well, now I do. I can use the HTML::TreeBuilder module from HTML-Tree CPAN distribution. It is easy enough to use the module itself for parsing:

my $tree = HTML::TreeBuilder->new_from_file("test.html");
my @img = $tree->look_down('tag', 'img',
sub { $_[0]->attribute("src") =~ m!thumbnail!; }

However, as I would rather use XPath, here is how to easily convert the HTML to XML so that is parseable with XPath:

my $xml = HTML::TreeBuilder->new_from_file("test.html")->as_XML();
my $xp = XML::XPath->new( xml => $xml );
my @img = $xp->find('//img[contains(@src, 'thumbnail')]);

And as always, use whichever approach is fit for the task at hand.