Parsing HTML with Perl and XPath

I have had to write an HTML parser a few times already and I've always used HTML::Parser. Because it's simple and because I never bothered to do it anohter way. I always wanted to ideally do some parsing with XPath because that way I can easily find what I need with one or two queries. Unfortunately, XPath only works on properly formatted XML documents, so unless the page I am parsing is XHTML, I couldn't do much.

I also wanted to somehow convert the HTML to properly formatted XHTML and use XPath with that. Unfortunately, I didn't know an easy way to do that either. Well, now I do. I can use the HTML::TreeBuilder module from HTML-Tree CPAN distribution. It is easy enough to use the module itself for parsing:

my $tree = HTML::TreeBuilder->new_from_file("test.html");
my @img = $tree->look_down('tag', 'img',
sub { $_[0]->attribute("src") =~ m!thumbnail!; }


However, as I would rather use XPath, here is how to easily convert the HTML to XML so that is parseable with XPath:

my $xml = HTML::TreeBuilder->new_from_file("test.html")->as_XML();
my $xp = XML::XPath->new( xml => $xml );
my @img = $xp->find('//img[contains(@src, 'thumbnail')]);


And as always, use whichever approach is fit for the task at hand.

Comments

  1. you could use HTML::TreeBuilder::XPath

    http://search.cpan.org/dist/HTML-TreeBuilder-XPath/lib/HTML/TreeBuilder/XPath.pm

    :D

    ReplyDelete
  2. #!/usr/bin/perl

    use strict;
    use HTML::Tree;
    use LWP::Simple;
    open FILE, "<", "tickers.txt" or die $!;
    while () {
    my($line) = $_;
    chomp($line);
    # Convert the line to upper case.
    $line =~ tr/[a-z]/[A-Z]/;
    my @arr=split(/,/, $line);
    print "Ticker" , "," , "BuyPrice" , ",", "Threshold" , "," , "CurrPrice"; print "\n";
    print $arr[0], "," , $arr[1] , ",", $arr[2] , "," ;
    my $funky = "http://www.xxxxx.com/News/CompanyBasicQuote.aspx?sskicode=$arr[0]+";
    my $content = get($funky);
    my $tree = HTML::Tree->new();
    $tree->parse($content);
    my ($title) = $tree->look_down( '_tag' , 'span' );
    my $curr_price = $tree->look_down(
    sub{ $_[0]-> tag() eq 'span' and ($_[0]->attr('id') =~ /BasicQuote1_lblNse1/)}
    );
    my $currP= $curr_price->as_text;
    print $curr_price->as_text , "\n";
    my $rel_value= @arr[1]-@arr[1]*(@arr[2]/100);
    print "rel_value " , $rel_value ,"\n";
    if ( $currP < $rel_value )
    { print "value is less than the threshold. Email" }
    else
    { print "value is not less than the threshold" };

    };
    close FILE or die $! ;

    #http://www.velikan.net/sendmail-for-windows-iis/
    #sub sendmail()
    #{
    # print "Content-type: text/html\n\n";
    #
    #$title='Trading Alerts';
    #$to='...email.com';
    #$from= '....uremail.com';
    #$subject='Trading Alerts';
    #
    #open(MAIL, "|/usr/sbin/sendmail -t");
    #
    ## Mail Header
    #print MAIL "To: $to\n";
    #print MAIL "From: $from\n";
    #print MAIL "Subject: $subject\n\n";
    ## Mail Body
    #print MAIL "This is a test message from sharekhan alerts customized! You can write your
    #mail body text here\n";
    #
    #close(MAIL);
    #
    #print "$title
    #\n\n\n";
    #
    ## HTML content let use know we sent an email
    #print "$title\n";
    #print "A message has been sent from $from to $to";
    #print "\n\n";
    #}

    ReplyDelete

Post a Comment

Popular posts from this blog

Installing Gentoo with full disk encryption

ADSL Router Model CT-5367 user and pass (VIVACOM)

FreeIPA cluster with containers