packages icon

 crawler(1)                                                       crawler(1)

      crawler - recursively explore the Internet and copy URLs documents

      crawler [options] -- [url] ...

      crawler is used to maintain a copy of a given set of URLs up to date
      on the local machine.

      First it loads the specified URLs (args ...) locally.  Then it
      analyses their contents to find other URLs starting at the same base.
      It loads the found URLs until there are no more or until it reaches
      the depth limit (see the -depth option).

      Note that the -- must be specified to mark the end of the options.
      Otherwise the arguments will be mistaken as options.

      -base <base name>
           name of the mysql database to use for the Meta data information.

      -user <name>
           Name of the user to connect to database.

      -password <password>
           Password of the -user to connect to database.

      -port <port>
           TCP/IP port to connect to database, if not default.

      -socket <file>
           Unix socket file full path name for local database connection.

      -host <hostname>
           Hostname to connect to database.

      -net_buffer <size>
           Size of the network buffer for communications with the database
           (default is 1Mb).

           Create all the tables. Exclusive, no other option accepted.

      -update Force updating even if the status of the Home Page says that
      it should resume after an interruption.

      -where_start <where clause>
           Only consider those Home Pages that match the <where clause>

                                    - 1 -           Formatted:  July 1, 2022

 crawler(1)                                                       crawler(1)


           Check and fix the concordance between the meta information
           database and the fulltext index.

           Remove all the records from the full text database and resubmit
           all the URLs for indexing. If the environment variable
           ECILA_NO_CRAWL is set, it will not crawl, just reindex the URLs
           for which a file is present in the crawler cache.

           Resume from an interrupted -rebuild.

           Do not notify the fulltext engine when a document is inserted or

      -timeout <sec>
           Set the TCP/IP timeout delay to <sec> seconds.  -loaded_delay
           <days> do not try to load URLs loaded successfully less than
           <days> days ago.
           It defaults to 7 days.

      -modified_delay <days>
           do not try to load URLs stated not modified less than <days> days
           It defaults to 30 days.

      -not_found_delay <days>
           do not try to load URLs stated not found less than <days> days
           It defaults to 60 days.

      -timeout_delay <days>
           do not try to load URLs for which a timeout (may be a real
           timeout or a connection failed, in short any error that is likely
           to disapear within a short delay) less than <days> days ago.
           It defaults to 3 days.

      -robot_delay <sec>
           number of seconds between request to the same server when robot
           exclusion compliant (default 60)

      -accept <mime types>
           only accept the specified mime types. A coma separated list of
           mime types specifications is allowed. The * may be used to
           specify any type or subtype.

                                    - 2 -           Formatted:  July 1, 2022

 crawler(1)                                                       crawler(1)

      -size_limit <limit in bytes>
           biggest URL loadable.

           is a boolean option that activates or deactivates heuristics
           preventing unecessary network access. An example of heuristic is
           to rely on the local image of an URL if this URL was loaded
           successfully from the WWW less than seven days ago. Such a
           behaviour is very usefull most of the time, but can be inhibited
           via the -noheuristics flag if it becomes undesirable.  The -
           loaded_delay, -not_modified_delay and -not_found_delay are
           parameters for the function that implement the heuristics and are
           ignored if the -noheuristics flag is set.

           is a boolean that controls the application of the proposed robot
           exclusion protocol. If -norobot_exclusion is given, the proposed
           robot exclusion protocol is not used. The -norobot_exclusion
           option should be used only when synchronizing a few URLs by hand
           or when calling it from an interactive application.

      -allow <list>
           In fashion as robots.txt, allow a list of prefixes for
           exploration. It comes in addition to the robots.txt info. <list>
           is a white space separated list of prefixes like "/dir /dir/~".

      -disallow <list>
           In fashion as robots.txt, disallow a list of prefixes, the robot
           will never visit them. It comes in addition to the robots.txt
           info. <list> is a white space separated list of prefixes like
           "/dir /dir/~".

      -agent <name>
           Set the User-Agent parameter for HTTP communication to <name>
           instead of the default value.

           sleep immediately if robot exclusion protocol requires it instead
           of scanning other URLs in the URL queue. This is usefull when
           loading a single server since all the URLs to be loaded are
           imposed the same delay.

      -depth <depth>
           set the exploration threshold.

      -filter <regexp>
           only load those URLs that match filter (/youpi/ && !/youpinsc/
           for instance). The URLs loaded are examined to find HREFs, if the
           depth of the search allows it (see -depth option).

                                    - 3 -           Formatted:  July 1, 2022

 crawler(1)                                                       crawler(1)

      -level <level>
           stop recursion after exploring <level> levels of hypertext links.
           The first URL explored is level one. The URLs contained in this
           URLs is level 2 and so on. Since -depth limits the number of
           documents, it should also be specified or the default applies.

      Those are only available if webbase was compiled with the mifluz
      indexing library. Those options will override the default values and
      the values found in the ~/.mifluz or $MIFLUZ_CONFIG file, if any.

           Display informations as indexing proceeds.

      -hook_cache_size <bytes>
           Indexing cache size hint, in bytes. Must roughly be 2% of the
           expected index size. The bigger the better (default 10000000).

      -hook_page_size <bytes>
           The page size of the underlying index file (default 4096).

      -hook_compress {1,0}
           The index file is compressed if 1, not compressed if 0 (default

           Main program debug messages.

           Database manipulation related messages.

           Network connection related messages.

           Binding with full text indexer related messages.

           Robots.txt loading and parsing related messages.

           Cookies handling related messages.

           Many messages from the crawl logic.

           Many messages from the crawl logic. Huge output.

                                    - 4 -           Formatted:  July 1, 2022

 crawler(1)                                                       crawler(1)

           Many messages from the url exclusion. This includes handling of
           the -allow and -disallow and of robots.txt.

 EXAMPLES -depth 1000

      Mirrors in depth the Web starting at URLs -unescape -depth 100

      Explores the web portion starting at until it finds 100
      documents or there is no documents left.


      Simple copy of the lawindex.html file on our local disk. No
      exploration is done since the default value for the -depth option is 0


                                    - 5 -           Formatted:  July 1, 2022