HTML::CruftText - Remove unuseful text from HTML Version 0.01 SYNOPSIS Removes junk from HTML page text. This module uses a regular expression based approach to remove cruft from HTML. I.e. content/text that is very unlikely to be useful or interesting. use HTML::CruftText; open (my $MYINPUTFILE, '; my $de_crufted_lines = HTML::CruftText::clearCruftText( \@lines); ... DESCRIPTION This module was developed for the Media Cloud project (http://mediacloud.org) as the first step in differentiating article text from ads, navigation, and other boilerplate text. Its approach is very conservative and almost never removes legitimate article text. However, it still leaves in a lot of cruft so many users will want to do additional processing. Typically, the clearCruftText method is called with an array reference containing the lines of an HTML file. Each line is then altered so that the cruft text is removed. After completion some lines will be entirely blank, while others will have certain text removed. In a few rare cases, additional HTML tags are added. The result is NOT GUARANTEED to be valid, balanced HTML though some HTML is retained because it is extremely useful for further processing. Thus some users will want to run an HTML stripper over the results. The following tactics are used to remove cruft text: * Nonbody text --anything outside of the tags -- is removed * Text within the following tags is removed: