Parsing HTML with Perl and XPath

August 08, 2008

I have had to write an HTML parser a few times already and I've always used HTML::Parser. Because it's simple and because I never bothered to do it anohter way. I always wanted to ideally do some parsing with XPath because that way I can easily find what I need with one or two queries. Unfortunately, XPath only works on properly formatted XML documents, so unless the page I am parsing is XHTML, I couldn't do much.

I also wanted to somehow convert the HTML to properly formatted XHTML and use XPath with that. Unfortunately, I didn't know an easy way to do that either. Well, now I do. I can use the HTML::TreeBuilder module from HTML-Tree CPAN distribution. It is easy enough to use the module itself for parsing:

my $tree = HTML::TreeBuilder->new_from_file("test.html");
my @img = $tree->look_down('tag', 'img',
sub { $_[0]->attribute("src") =~ m!thumbnail!; }

However, as I would rather use XPath, here is how to easily convert the HTML to XML so that is parseable with XPath:

my $xml = HTML::TreeBuilder->new_from_file("test.html")->as_XML();
my $xp = XML::XPath->new( xml => $xml );
my @img = $xp->find('//img[contains(@src, 'thumbnail')]);

And as always, use whichever approach is fit for the task at hand.

Comments

SandiaOctober 14, 2008 at 4:01 PM
you could use HTML::TreeBuilder::XPath

http://search.cpan.org/dist/HTML-TreeBuilder-XPath/lib/HTML/TreeBuilder/XPath.pm

:D
ReplyDelete
Replies
nilesideSeptember 18, 2009 at 4:36 PM
#!/usr/bin/perl

use strict;
use HTML::Tree;
use LWP::Simple;
open FILE, "<", "tickers.txt" or die $!;
while () {
my($line) = $_;
chomp($line);
# Convert the line to upper case.
$line =~ tr/[a-z]/[A-Z]/;
my @arr=split(/,/, $line);
print "Ticker" , "," , "BuyPrice" , ",", "Threshold" , "," , "CurrPrice"; print "\n";
print $arr[0], "," , $arr[1] , ",", $arr[2] , "," ;
my $funky = "http://www.xxxxx.com/News/CompanyBasicQuote.aspx?sskicode=$arr[0]+";
my $content = get($funky);
my $tree = HTML::Tree->new();
$tree->parse($content);
my ($title) = $tree->look_down( '_tag' , 'span' );
my $curr_price = $tree->look_down(
sub{ $_[0]-> tag() eq 'span' and ($_[0]->attr('id') =~ /BasicQuote1_lblNse1/)}
);
my $currP= $curr_price->as_text;
print $curr_price->as_text , "\n";
my $rel_value= @arr[1]-@arr[1]*(@arr[2]/100);
print "rel_value " , $rel_value ,"\n";
if ( $currP < $rel_value )
{ print "value is less than the threshold. Email" }
else
{ print "value is not less than the threshold" };

};
close FILE or die $! ;

#http://www.velikan.net/sendmail-for-windows-iis/
#sub sendmail()
#{
# print "Content-type: text/html\n\n";
#
#$title='Trading Alerts';
#$to='...email.com';
#$from= '....uremail.com';
#$subject='Trading Alerts';
#
#open(MAIL, "|/usr/sbin/sendmail -t");
#
## Mail Header
#print MAIL "To: $to\n";
#print MAIL "From: $from\n";
#print MAIL "Subject: $subject\n\n";
## Mail Body
#print MAIL "This is a test message from sharekhan alerts customized! You can write your
#mail body text here\n";
#
#close(MAIL);
#
#print "$title
#\n\n\n";
#
## HTML content let use know we sent an email
#print "$title\n";
#print "A message has been sent from $from to $to";
#print "\n\n";
#}
ReplyDelete
Replies

Add comment

Search This Blog

Landfill of Wisdom

Parsing HTML with Perl and XPath

Comments

Post a Comment

Popular posts from this blog

Installing Gentoo with full disk encryption

Remotely connect to Windows 11 PC using a "passwordless" account

FreeIPA cluster with containers