packages icon



 crawler(1)                                                       crawler(1)
                                    local



 NAME
      crawler - recursively explore the Internet and copy URLs documents
      locally

 SYNOPSIS
      crawler [options] -- [url] ...

 DESCRIPTION
      crawler is used to maintain a copy of a given set of URLs up to date
      on the local machine.

      First it loads the specified URLs (args ...) locally.  Then it
      analyses their contents to find other URLs starting at the same base.
      It loads the found URLs until there are no more or until it reaches
      the depth limit (see the -depth option).

      Note that the -- must be specified to mark the end of the options.
      Otherwise the arguments will be mistaken as options.

 DATABASE OPTIONS
      -base <base name>
           name of the mysql database to use for the Meta data information.

      -user <name>
           Name of the user to connect to database.

      -password <password>
           Password of the -user to connect to database.

      -port <port>
           TCP/IP port to connect to database, if not default.

      -socket <file>
           Unix socket file full path name for local database connection.

      -host <hostname>
           Hostname to connect to database.

      -net_buffer <size>
           Size of the network buffer for communications with the database
           (default is 1Mb).

      -create
           Create all the tables. Exclusive, no other option accepted.

 CRAWL OPTIONS
      -update Force updating even if the status of the Home Page says that
      it should resume after an interruption.

      -where_start <where clause>
           Only consider those Home Pages that match the <where clause>



                                    - 1 -         Formatted:  April 25, 2024






 crawler(1)                                                       crawler(1)
                                    local



           restriction.

      -rehook
           Check and fix the concordance between the meta information
           database and the fulltext index.

      -rebuild
           Remove all the records from the full text database and resubmit
           all the URLs for indexing. If the environment variable
           ECILA_NO_CRAWL is set, it will not crawl, just reindex the URLs
           for which a file is present in the crawler cache.

      -rebuild_resume
           Resume from an interrupted -rebuild.

      -no_hook
           Do not notify the fulltext engine when a document is inserted or
           removed.

      -timeout <sec>
           Set the TCP/IP timeout delay to <sec> seconds.  -loaded_delay
           <days> do not try to load URLs loaded successfully less than
           <days> days ago.
           It defaults to 7 days.

      -modified_delay <days>
           do not try to load URLs stated not modified less than <days> days
           ago.
           It defaults to 30 days.

      -not_found_delay <days>
           do not try to load URLs stated not found less than <days> days
           ago.
           It defaults to 60 days.

      -timeout_delay <days>
           do not try to load URLs for which a timeout (may be a real
           timeout or a connection failed, in short any error that is likely
           to disapear within a short delay) less than <days> days ago.
           It defaults to 3 days.

      -robot_delay <sec>
           number of seconds between request to the same server when robot
           exclusion compliant (default 60)

      -accept <mime types>
           only accept the specified mime types. A coma separated list of
           mime types specifications is allowed. The * may be used to
           specify any type or subtype.





                                    - 2 -         Formatted:  April 25, 2024






 crawler(1)                                                       crawler(1)
                                    local



      -size_limit <limit in bytes>
           biggest URL loadable.

      -noheuristics
           is a boolean option that activates or deactivates heuristics
           preventing unecessary network access. An example of heuristic is
           to rely on the local image of an URL if this URL was loaded
           successfully from the WWW less than seven days ago. Such a
           behaviour is very usefull most of the time, but can be inhibited
           via the -noheuristics flag if it becomes undesirable.  The -
           loaded_delay, -not_modified_delay and -not_found_delay are
           parameters for the function that implement the heuristics and are
           ignored if the -noheuristics flag is set.

      -norobot_exclusion
           is a boolean that controls the application of the proposed robot
           exclusion protocol. If -norobot_exclusion is given, the proposed
           robot exclusion protocol is not used. The -norobot_exclusion
           option should be used only when synchronizing a few URLs by hand
           or when calling it from an interactive application.

      -allow <list>
           In fashion as robots.txt, allow a list of prefixes for
           exploration. It comes in addition to the robots.txt info. <list>
           is a white space separated list of prefixes like "/dir /dir/~".

      -disallow <list>
           In fashion as robots.txt, disallow a list of prefixes, the robot
           will never visit them. It comes in addition to the robots.txt
           info. <list> is a white space separated list of prefixes like
           "/dir /dir/~".

      -agent <name>
           Set the User-Agent parameter for HTTP communication to <name>
           instead of the default value.

      -sleepy
           sleep immediately if robot exclusion protocol requires it instead
           of scanning other URLs in the URL queue. This is usefull when
           loading a single server since all the URLs to be loaded are
           imposed the same delay.

      -depth <depth>
           set the exploration threshold.

      -filter <regexp>
           only load those URLs that match filter (/youpi/ && !/youpinsc/
           for instance). The URLs loaded are examined to find HREFs, if the
           depth of the search allows it (see -depth option).





                                    - 3 -         Formatted:  April 25, 2024






 crawler(1)                                                       crawler(1)
                                    local



      -level <level>
           stop recursion after exploring <level> levels of hypertext links.
           The first URL explored is level one. The URLs contained in this
           URLs is level 2 and so on. Since -depth limits the number of
           documents, it should also be specified or the default applies.

 MIFLUZ OPTIONS
      Those are only available if webbase was compiled with the mifluz
      indexing library. Those options will override the default values and
      the values found in the ~/.mifluz or $MIFLUZ_CONFIG file, if any.

      -verbose_hooks
           Display informations as indexing proceeds.

      -hook_cache_size <bytes>
           Indexing cache size hint, in bytes. Must roughly be 2% of the
           expected index size. The bigger the better (default 10000000).

      -hook_page_size <bytes>
           The page size of the underlying index file (default 4096).

      -hook_compress {1,0}
           The index file is compressed if 1, not compressed if 0 (default
           compressed).

 DEBUG OPTIONS
      -verbose
           Main program debug messages.

      -verbose_webbase
           Database manipulation related messages.

      -verbose_webtools
           Network connection related messages.

      -verbose_hook
           Binding with full text indexer related messages.

      -verbose_robots
           Robots.txt loading and parsing related messages.

      -verbose_cookies
           Cookies handling related messages.

      -verbose_crawl
           Many messages from the crawl logic.

      -babil_crawl
           Many messages from the crawl logic. Huge output.





                                    - 4 -         Formatted:  April 25, 2024






 crawler(1)                                                       crawler(1)
                                    local



      -verbose_dirsel
           Many messages from the url exclusion. This includes handling of
           the -allow and -disallow and of robots.txt.

 EXAMPLES
      crawler.pl -depth 1000 http://www.droit.umontreal.ca/

      Mirrors in depth the Web starting at URLs
      http://www.droit.umontreal.ca/.

      crawler.pl -unescape -depth 100 http://www.iway.fr/wwb.cgi

      Explores the web portion starting at
      http://www.iway.fr/htbin/iway/webfr_display.cgi until it finds 100
      documents or there is no documents left.

      crawler http://www.law.indiana.edu:80/law/lawindex.html

      Simple copy of the lawindex.html file on our local disk. No
      exploration is done since the default value for the -depth option is 0

 SEE ALSO
      uri(3)































                                    - 5 -         Formatted:  April 25, 2024