NAME nsitemap - functions for generating a site map for a given site URL SYNOPSIS use nsitemap; use LWP::UserAgent; my $ua = new LWP::UserAgent; my $sitemap = new nsitemap( EMAIL => 'your@email.address', USERAGENT => $ua, ROOT => 'http://your.ip.address/' ); $sitemap->generate(); $sitemap->option( 'VERBOSE' => 1 ); my $len = $sitemap->option( 'SUMMARY_LENGTH' ); my $root = $sitemap->root(); for my $url ( $sitemap->urls() ) { if ( $sitemap->is_internal_url( $url ) ) { # do something ... } my @links = $sitemap->links( $url ); my $title = $sitemap->title( $url ); my $summary = $sitemap->summary( $url ); my $depth = $sitemap->depth( $url ); my $digest = $sitemap->MD5digest( $url ); } $sitemap->traverse( sub { my ( $sitemap, $url, $depth, $flag ) = @_; if ( $flag == 0 ) { # do something at the start of a list of sub-pages ... } elsif( $flag == 1 ) { # do something for each page ... } elsif( $flag == 2 ) { # do something at the end of a list of sub-pages ... } } ) DESCRIPTION The `nsitemap' module creates a site map for a WWW site, by traversing the site using the WWW::Robot module. The nsitemap object has a number of methods to access a list of all the urls in the site; a list of all the links for each url; page titles; page summaries; page fingerprints (MD5digest); and the depth, or mimimum number of links from the root URL to a page. CONSTRUCTOR nsitemap->new [ $option => $value ] ... my $sitemap = new nsitemap( EMAIL => 'your@email.address', USERAGENT => new LWP::UserAgent, ROOT => 'http://www.my.com/' ); Possible option are: EMAIL The email address the robot uses to identify itself. This option is required. ROOT Root URL of the site for which the site map is being created. This option is required. USERAGENT User agent (typically 'new LWP::UserAgent') used by the robot. This option is required. VERBOSE Verbose flag, for printing out useful messages during traversal [0 or 1]. Defaults to 0. SUMMARY_LENGTH Maximum length of (automatically generated) summary. Defaults to 200. DEPTH Maximum depth of traversal. Defaults to no limit. METHODS generate( ) Method for generating the site map, based on the constructor options. $site->generate(); option( $option [=> $value ] ) Interface to get / set options after object construction. $site->option( 'VERBOSE' => 1 ); my $len = $site->option( 'SUMMARY_LENGTH' ); root( ) Returns the root URL for the site. my $root = $site->root(); urls( ) Returns a list of all the URLs on the site map. my @urls = $site->urls(); is_internal_url( $url ) Returns 1 (one) if $url is an internal URL based on the ROOT value. Otherwise returns 0 (zero); if ( $site->is_internal_url( $url ) ) { # do something ... } links( $url ) Returns a list of all the links from a given URL in the site map. my @links = $site->links( $url ); title( $url ) Returns the title of the URL based on the TITLE tag my $title = $site->title( $url ); MD5digest( $url ) Returns the MD5_hex (fingerprint) of the URL. my $fingerprint = $site->MD5digest( $url ); summary( $url ) Returns a summary of the URL; generated using HTML::Summary. If the URL has a NAME='description' META tag, returns the value of CONTENT. Otherwise it attempts to summarize the text. my $summary = $site>summary( $url ); depth( $url ) Returns the minimum number of links to traverse from the root URL of the site to this URL. The root URL is at depth zero. my $depth = $sitemap->depth( $url ); traverse( \&callback ) The traverse method walks the site map, starting at the root node (spcificed by -url), and visits each URL in the order that they would be displayed in a sequential site map of the site. The callback is called in a number of places in the traversal as indicated by the $flag argument to the callback: $flag = 0 Triggered before each set of daughter URLs of a given URL. $flag = 1 Triggered for each URL. $flag = 2 Triggered after each set of daughter URLs of a given URL. SEE ALSO LWP::UserAgent HTML::Summary WWW::Robot AUTHOR Steve Horsburgh CREDITS This utility was inspired by the 1997 Sitemap.pm utility by Ave Wrigley COPYRIGHT Copyright (c) 2000, Horsburgh.com. All rights reserved. This script is free software; you can redistribute it and/or modify it under GNU GPL. (See the file COPYING)