crawler(1) crawler(1)
local
NAME
crawler - recursively explore the Internet and copy URLs documents
locally
SYNOPSIS
crawler [options] -- [url] ...
DESCRIPTION
crawler is used to maintain a copy of a given set of URLs up to date
on the local machine.
First it loads the specified URLs (args ...) locally. Then it
analyses their contents to find other URLs starting at the same base.
It loads the found URLs until there are no more or until it reaches
the depth limit (see the -depth option).
Note that the -- must be specified to mark the end of the options.
Otherwise the arguments will be mistaken as options.
DATABASE OPTIONS
-base <base name>
name of the mysql database to use for the Meta data information.
-user <name>
Name of the user to connect to database.
-password <password>
Password of the -user to connect to database.
-port <port>
TCP/IP port to connect to database, if not default.
-socket <file>
Unix socket file full path name for local database connection.
-host <hostname>
Hostname to connect to database.
-net_buffer <size>
Size of the network buffer for communications with the database
(default is 1Mb).
-create
Create all the tables. Exclusive, no other option accepted.
CRAWL OPTIONS
-update Force updating even if the status of the Home Page says that
it should resume after an interruption.
-where_start <where clause>
Only consider those Home Pages that match the <where clause>
- 1 - Formatted: November 8, 2025
crawler(1) crawler(1)
local
restriction.
-rehook
Check and fix the concordance between the meta information
database and the fulltext index.
-rebuild
Remove all the records from the full text database and resubmit
all the URLs for indexing. If the environment variable
ECILA_NO_CRAWL is set, it will not crawl, just reindex the URLs
for which a file is present in the crawler cache.
-rebuild_resume
Resume from an interrupted -rebuild.
-no_hook
Do not notify the fulltext engine when a document is inserted or
removed.
-timeout <sec>
Set the TCP/IP timeout delay to <sec> seconds. -loaded_delay
<days> do not try to load URLs loaded successfully less than
<days> days ago.
It defaults to 7 days.
-modified_delay <days>
do not try to load URLs stated not modified less than <days> days
ago.
It defaults to 30 days.
-not_found_delay <days>
do not try to load URLs stated not found less than <days> days
ago.
It defaults to 60 days.
-timeout_delay <days>
do not try to load URLs for which a timeout (may be a real
timeout or a connection failed, in short any error that is likely
to disapear within a short delay) less than <days> days ago.
It defaults to 3 days.
-robot_delay <sec>
number of seconds between request to the same server when robot
exclusion compliant (default 60)
-accept <mime types>
only accept the specified mime types. A coma separated list of
mime types specifications is allowed. The * may be used to
specify any type or subtype.
- 2 - Formatted: November 8, 2025
crawler(1) crawler(1)
local
-size_limit <limit in bytes>
biggest URL loadable.
-noheuristics
is a boolean option that activates or deactivates heuristics
preventing unecessary network access. An example of heuristic is
to rely on the local image of an URL if this URL was loaded
successfully from the WWW less than seven days ago. Such a
behaviour is very usefull most of the time, but can be inhibited
via the -noheuristics flag if it becomes undesirable. The -
loaded_delay, -not_modified_delay and -not_found_delay are
parameters for the function that implement the heuristics and are
ignored if the -noheuristics flag is set.
-norobot_exclusion
is a boolean that controls the application of the proposed robot
exclusion protocol. If -norobot_exclusion is given, the proposed
robot exclusion protocol is not used. The -norobot_exclusion
option should be used only when synchronizing a few URLs by hand
or when calling it from an interactive application.
-allow <list>
In fashion as robots.txt, allow a list of prefixes for
exploration. It comes in addition to the robots.txt info. <list>
is a white space separated list of prefixes like "/dir /dir/~".
-disallow <list>
In fashion as robots.txt, disallow a list of prefixes, the robot
will never visit them. It comes in addition to the robots.txt
info. <list> is a white space separated list of prefixes like
"/dir /dir/~".
-agent <name>
Set the User-Agent parameter for HTTP communication to <name>
instead of the default value.
-sleepy
sleep immediately if robot exclusion protocol requires it instead
of scanning other URLs in the URL queue. This is usefull when
loading a single server since all the URLs to be loaded are
imposed the same delay.
-depth <depth>
set the exploration threshold.
-filter <regexp>
only load those URLs that match filter (/youpi/ && !/youpinsc/
for instance). The URLs loaded are examined to find HREFs, if the
depth of the search allows it (see -depth option).
- 3 - Formatted: November 8, 2025
crawler(1) crawler(1)
local
-level <level>
stop recursion after exploring <level> levels of hypertext links.
The first URL explored is level one. The URLs contained in this
URLs is level 2 and so on. Since -depth limits the number of
documents, it should also be specified or the default applies.
MIFLUZ OPTIONS
Those are only available if webbase was compiled with the mifluz
indexing library. Those options will override the default values and
the values found in the ~/.mifluz or $MIFLUZ_CONFIG file, if any.
-verbose_hooks
Display informations as indexing proceeds.
-hook_cache_size <bytes>
Indexing cache size hint, in bytes. Must roughly be 2% of the
expected index size. The bigger the better (default 10000000).
-hook_page_size <bytes>
The page size of the underlying index file (default 4096).
-hook_compress {1,0}
The index file is compressed if 1, not compressed if 0 (default
compressed).
DEBUG OPTIONS
-verbose
Main program debug messages.
-verbose_webbase
Database manipulation related messages.
-verbose_webtools
Network connection related messages.
-verbose_hook
Binding with full text indexer related messages.
-verbose_robots
Robots.txt loading and parsing related messages.
-verbose_cookies
Cookies handling related messages.
-verbose_crawl
Many messages from the crawl logic.
-babil_crawl
Many messages from the crawl logic. Huge output.
- 4 - Formatted: November 8, 2025
crawler(1) crawler(1)
local
-verbose_dirsel
Many messages from the url exclusion. This includes handling of
the -allow and -disallow and of robots.txt.
EXAMPLES
crawler.pl -depth 1000 http://www.droit.umontreal.ca/
Mirrors in depth the Web starting at URLs
http://www.droit.umontreal.ca/.
crawler.pl -unescape -depth 100 http://www.iway.fr/wwb.cgi
Explores the web portion starting at
http://www.iway.fr/htbin/iway/webfr_display.cgi until it finds 100
documents or there is no documents left.
crawler http://www.law.indiana.edu:80/law/lawindex.html
Simple copy of the lawindex.html file on our local disk. No
exploration is done since the default value for the -depth option is 0
SEE ALSO
uri(3)
- 5 - Formatted: November 8, 2025