crawler(1) crawler(1) local NAME crawler - recursively explore the Internet and copy URLs documents locally SYNOPSIS crawler [options] -- [url] ... DESCRIPTION crawler is used to maintain a copy of a given set of URLs up to date on the local machine. First it loads the specified URLs (args ...) locally. Then it analyses their contents to find other URLs starting at the same base. It loads the found URLs until there are no more or until it reaches the depth limit (see the -depth option). Note that the -- must be specified to mark the end of the options. Otherwise the arguments will be mistaken as options. DATABASE OPTIONS -base <base name> name of the mysql database to use for the Meta data information. -user <name> Name of the user to connect to database. -password <password> Password of the -user to connect to database. -port <port> TCP/IP port to connect to database, if not default. -socket <file> Unix socket file full path name for local database connection. -host <hostname> Hostname to connect to database. -net_buffer <size> Size of the network buffer for communications with the database (default is 1Mb). -create Create all the tables. Exclusive, no other option accepted. CRAWL OPTIONS -update Force updating even if the status of the Home Page says that it should resume after an interruption. -where_start <where clause> Only consider those Home Pages that match the <where clause> - 1 - Formatted: April 25, 2024 crawler(1) crawler(1) local restriction. -rehook Check and fix the concordance between the meta information database and the fulltext index. -rebuild Remove all the records from the full text database and resubmit all the URLs for indexing. If the environment variable ECILA_NO_CRAWL is set, it will not crawl, just reindex the URLs for which a file is present in the crawler cache. -rebuild_resume Resume from an interrupted -rebuild. -no_hook Do not notify the fulltext engine when a document is inserted or removed. -timeout <sec> Set the TCP/IP timeout delay to <sec> seconds. -loaded_delay <days> do not try to load URLs loaded successfully less than <days> days ago. It defaults to 7 days. -modified_delay <days> do not try to load URLs stated not modified less than <days> days ago. It defaults to 30 days. -not_found_delay <days> do not try to load URLs stated not found less than <days> days ago. It defaults to 60 days. -timeout_delay <days> do not try to load URLs for which a timeout (may be a real timeout or a connection failed, in short any error that is likely to disapear within a short delay) less than <days> days ago. It defaults to 3 days. -robot_delay <sec> number of seconds between request to the same server when robot exclusion compliant (default 60) -accept <mime types> only accept the specified mime types. A coma separated list of mime types specifications is allowed. The * may be used to specify any type or subtype. - 2 - Formatted: April 25, 2024 crawler(1) crawler(1) local -size_limit <limit in bytes> biggest URL loadable. -noheuristics is a boolean option that activates or deactivates heuristics preventing unecessary network access. An example of heuristic is to rely on the local image of an URL if this URL was loaded successfully from the WWW less than seven days ago. Such a behaviour is very usefull most of the time, but can be inhibited via the -noheuristics flag if it becomes undesirable. The - loaded_delay, -not_modified_delay and -not_found_delay are parameters for the function that implement the heuristics and are ignored if the -noheuristics flag is set. -norobot_exclusion is a boolean that controls the application of the proposed robot exclusion protocol. If -norobot_exclusion is given, the proposed robot exclusion protocol is not used. The -norobot_exclusion option should be used only when synchronizing a few URLs by hand or when calling it from an interactive application. -allow <list> In fashion as robots.txt, allow a list of prefixes for exploration. It comes in addition to the robots.txt info. <list> is a white space separated list of prefixes like "/dir /dir/~". -disallow <list> In fashion as robots.txt, disallow a list of prefixes, the robot will never visit them. It comes in addition to the robots.txt info. <list> is a white space separated list of prefixes like "/dir /dir/~". -agent <name> Set the User-Agent parameter for HTTP communication to <name> instead of the default value. -sleepy sleep immediately if robot exclusion protocol requires it instead of scanning other URLs in the URL queue. This is usefull when loading a single server since all the URLs to be loaded are imposed the same delay. -depth <depth> set the exploration threshold. -filter <regexp> only load those URLs that match filter (/youpi/ && !/youpinsc/ for instance). The URLs loaded are examined to find HREFs, if the depth of the search allows it (see -depth option). - 3 - Formatted: April 25, 2024 crawler(1) crawler(1) local -level <level> stop recursion after exploring <level> levels of hypertext links. The first URL explored is level one. The URLs contained in this URLs is level 2 and so on. Since -depth limits the number of documents, it should also be specified or the default applies. MIFLUZ OPTIONS Those are only available if webbase was compiled with the mifluz indexing library. Those options will override the default values and the values found in the ~/.mifluz or $MIFLUZ_CONFIG file, if any. -verbose_hooks Display informations as indexing proceeds. -hook_cache_size <bytes> Indexing cache size hint, in bytes. Must roughly be 2% of the expected index size. The bigger the better (default 10000000). -hook_page_size <bytes> The page size of the underlying index file (default 4096). -hook_compress {1,0} The index file is compressed if 1, not compressed if 0 (default compressed). DEBUG OPTIONS -verbose Main program debug messages. -verbose_webbase Database manipulation related messages. -verbose_webtools Network connection related messages. -verbose_hook Binding with full text indexer related messages. -verbose_robots Robots.txt loading and parsing related messages. -verbose_cookies Cookies handling related messages. -verbose_crawl Many messages from the crawl logic. -babil_crawl Many messages from the crawl logic. Huge output. - 4 - Formatted: April 25, 2024 crawler(1) crawler(1) local -verbose_dirsel Many messages from the url exclusion. This includes handling of the -allow and -disallow and of robots.txt. EXAMPLES crawler.pl -depth 1000 http://www.droit.umontreal.ca/ Mirrors in depth the Web starting at URLs http://www.droit.umontreal.ca/. crawler.pl -unescape -depth 100 http://www.iway.fr/wwb.cgi Explores the web portion starting at http://www.iway.fr/htbin/iway/webfr_display.cgi until it finds 100 documents or there is no documents left. crawler http://www.law.indiana.edu:80/law/lawindex.html Simple copy of the lawindex.html file on our local disk. No exploration is done since the default value for the -depth option is 0 SEE ALSO uri(3) - 5 - Formatted: April 25, 2024