HTML::CruftText - Remove unuseful text from HTML
Version 0.01
SYNOPSIS
Removes junk from HTML page text.
This module uses a regular expression based approach to remove cruft
from HTML. I.e. content/text that is very unlikely to be useful or
interesting.
use HTML::CruftText;
open (my $MYINPUTFILE, ';
my $de_crufted_lines = HTML::CruftText::clearCruftText( \@lines);
...
DESCRIPTION
This module was developed for the Media Cloud project
(http://mediacloud.org) as the first step in differentiating article
text from ads, navigation, and other boilerplate text. Its approach is
very conservative and almost never removes legitimate article text.
However, it still leaves in a lot of cruft so many users will want to
do additional processing.
Typically, the clearCruftText method is called with an array reference
containing the lines of an HTML file. Each line is then altered so that
the cruft text is removed. After completion some lines will be entirely
blank, while others will have certain text removed. In a few rare
cases, additional HTML tags are added. The result is NOT GUARANTEED to
be valid, balanced HTML though some HTML is retained because it is
extremely useful for further processing. Thus some users will want to
run an HTML stripper over the results.
The following tactics are used to remove cruft text:
* Nonbody text --anything outside of the tags -- is
removed
* Text within the following tags is removed: