packages icon



 catdoc(1)                     MS-Word reader                      catdoc(1)
                          Version @catdoc_version@



 NAME
      catdoc - reads MS-Word file and puts its content as plain text on
      standard output

 SYNOPSIS
      catdoc [-vlu8btawxV] [-m number] [ -s charset] [ -d charset] [ -f
      output-format] file


 DESCRIPTION
      catdoc behaves much like cat(1) but it reads MS-Word file and produces
      human-readable text on standard output.  Optionally it can use
      latex(1) escape sequences for characters which have special meaning
      for LaTeX.  It also makes some effort to recognize MS-Word tables,
      although it never tries to write correct headers for LaTeX tabular
      environment. Additional output formats, such is HTML can be easily
      defined.

      catdoc doesn't attempt to extract formatting information other than
      tables from MS-Word document, so different output modes means mainly
      that different characters should be escaped and different ways used to
      represent characters, missing from output charset. See CHARACTER
      SUBSTITUTION below


      catdoc uses internal unicode(4) representation of text, so it is able
      to convert texts when charset in source document doesn't match charset
      on target system.  See CHARACTER SETS below.

      If no file names supplied, catdoc processes its standard input unless
      it is terminal. It is unlikely that somebody could type Word document
      from keyboard, so if catdoc invoked without arguments and stdin is not
      redirected, it prints brief usage message and exits. Processing of
      standard input (even among other files) can be forced using dash '-'
      as file name.

      By default, catdoc wraps lines which are more than 72 chars long and
      separates paragraphs by blank lines. This behavior can be turned of by
      -w switch. In wide mode catdoc prints each paragraph as one word
      processors that perform word wrapping.



 OPTIONS
      -a      - shortcut for -f ascii. Produces ASCII text as output.
              Separates table columns with TAB

      -b      - process broken MS-Word file. Normally, catdoc checks if
              first 8 bytes of file is Microsoft OLE signature. If so, it
              processes file, otherwise it just copies it to stdin. It is
              intended to use catdoc as filter for viewing all files with



                                    - 1 -         Formatted:  April 20, 2024






 catdoc(1)                     MS-Word reader                      catdoc(1)
                          Version @catdoc_version@



              .doc extension.

      -dcharset
              - specifies destination charset name. Charset file has format
              described in CHARACTER SETS below and should have .txt
              extension  and reside in catdoc library directory (
              /usr/local/lib/hpux64/catdoc). By locale charset is used if
              langinfo support compiled in.

      -fformat
              - specifies output format as described in CHARACTER
              SUBSTITUTION below.  catdoc comes with two output formats -
              ascii and tex. You can add your own if you wish.

      -l      Causes catdoc to list names of available charsets to the
              stdout and exit successfully.

      -mnumber
              Specifies right margin for text  (default 72). -m 0 is
              equivalent to -w

      -scharset
              Specifies source charset. (one used in Word document), if Word
              document doesn't contain UTF-16  text. When reading rtf
              documents, it is typically not necessary, because rtf
              documents contain ansicpg specification. But it can be set
              wrong by Word (I've seen RTF documents on Russian, where
              cp1252 was specified). In this case this option would take
              precedence over charset, specified in the document. But
              source_charset statement in the configuration file have less
              priority than charset in the document.

      -t      - shortcut for -f tex
               converts all printable chars, which have special meaning for
              LaTeX(1) into appropriate control sequences. Separates table
              columns by &.

      -u      - declares that Word  document  contain  UNICODE   (UTF-16)
              representation of text (as some Word-97 documents). If catdoc
              fails to correct  Word document with  default charset,   try
              this  option.

      -8      - declares is Word document is 8 bit. Just in case that catdoc
               recognizes file format incorrectly.

      -w      disables word wrapping. By default catdoc output is split into
              lines not longer than 72 (or  number, specified by -m  option)
              characters and paragraphs are separated by blank line. With
              this option each paragraph is one long line.





                                    - 2 -         Formatted:  April 20, 2024






 catdoc(1)                     MS-Word reader                      catdoc(1)
                          Version @catdoc_version@



      -x      causes catdoc to output unknown UNICODE character as \xNNNN,
              instead of question marks.

      -v      causes catdoc to print some useless information about word
              document structure to stdout before actual start of text.

      -V      outputs catdoc version


 CHARACTER SETS
      When processing MS-Word file catdoc uses information about two
      character sets, typically different
       -  input and output. They are stored in plain text files in catdoc
      library directory. Character set files should contain two whitespace-
      separated hexadecimal numbers - 8-bit code in character set and 16-bit
      Unicode code.  Anything from hash mark to end of line is ignored, as
      well as blank lines.

      catdoc distribution includes some of these character sets. Additional
      character set definitions, directly usable by catdoc can be obtained
      from ftp.unicode.org. Charset files have .txt suffix, which shouldn't
      be specified in command-line or configuration files.

      Note that catdoc is distributed with Cyrillic charsets as default. If
      you are not Russian, you probably don't want it, an should reconfigure
      catdoc at compile time or in runtime configuration file.

      When dealing with documents with charsets other than default, remember
      that Microsoft never uses ISO charsets. While letters in, say cp1252
      are at the same position as in ISO-8859-1, some punctuation signs
      would be lost, if you specify ISO-8859-1 as input charset. If you use
      cp1252, catdoc would deal with those signs as described in CHARACTER
      SUBSTITUTION below.


 CHARACTER SUBSTITUTION
      catdoc converts  MS-Word file into following internal Unicode
      representation:

      1. Paragraphs are separated by ASCII Line Feed symbol (0x000A)

      2. Table cells within row are separated by ASCII Field Separator symbol
          (0x001C)

      3. Table rows are separated by ASCII Record Separator (0x001E)

 their
      4. All printable characters, including whitespace are represented with
          respective UNICODE codes.





                                    - 3 -         Formatted:  April 20, 2024






 catdoc(1)                     MS-Word reader                      catdoc(1)
                          Version @catdoc_version@



      This UNICODE representation is subsequently converted into 8-bit text
      in target character set using following four-step algorithm:

      1. List of special characters is searched for given Unicode character.
          If found, then appropriate multi-character sequence is output
          instead of character.

      2. If there is an equivalent in target character set, it is output.

      3. Otherwise, replacement list is searched and, if there is multi-
          character
          substitution for this UNICODE char, it is output.

      4. If all above fails, "Unknown char" symbol (question mark) is output.

      Lists of special characters and list of substitution are character
      set-independent, because special chars should be escaped regardless of
      their existence in target character set  (usually, they are parts of
      US-ASCII, and therefore exist in any character set) and replacement
      list is searched only for those characters, which are not found in
      target character set.

      These lists are stored in catdoc library directory in files with
      prefix of format name. These files have following format:

      Each line can be either comment (starting with hash mark) or contain
      hexadecimal UNICODE value, separated by whitespace from string, which
      would be substituted instead of it. If string contain no whitespace it
      can be used as is, otherwise it should be enclosed in single or double
      quotes. Usual backslash sequences like '\n','\t' can be used in these
      string.



 RUNTIME CONFIGURATION
      Upon startup catdoc reads its system-wide configuration file (
      catdocrc in catdoc library directory) and then user-specific
      configuration file ${HOME}/.catdocrc.

      These files can contain following directives:

      source_charset = charset-name
              Sets default source charset, which would be used if no -s
              option specified. Consult configuration of nearby windows
              workstation to find one you need.

      target_charset = charset-name
               Sets default output charset. You probably know, which one you
              use.





                                    - 4 -         Formatted:  April 20, 2024






 catdoc(1)                     MS-Word reader                      catdoc(1)
                          Version @catdoc_version@



      charset_path = directory-list
              colon-separated list of directories, which are searched for
              charset files.  This allows you to install additional charsets
              in your home directory.  If first directory component of path
              is ~ it is replaced by contents of HOME environment variable.
              On MS-DOS platform, if directory name starts with %s, it is
              replaced with directory of executable file. Empty element in
              list (i.e. two consequitve colons) is considered current
              directory.

      map_path = directory-list
              colon-separated list of directories, which are searched for
              special character map and replacement map.  Same substitution
              rules as in charset_path are applied.

      format = format name
              Output format which would be used by default.  catdoc comes
              with two formats - ascii and tex but nothing prevents you from
              writing your own format (set two map files - special character
              map and replacement map).

      unknown_char = character specification
              sets character to output instead of unknown Unicode character
              (default '?') Character specification can have one of two form
              - character enclosed in single quotes or hexadecimal code.

      use_locale =(yes|no)
              Enables or disables automatic selection of output charset
              (default yes),
               based on system locale settings (if enabled at compile time).
              If automatic detection is enabled, than output charset
              settings in the configuration files (but not in the command
              line) are ignored, and current system locale charset is used
              instead. There are no automatic choice of input charset, based
              of locale language, because most modern Word files (since Word
              97) are Unicode anyway


 BUGS
      Doesn't handle fast-saves properly. Prints footnotes as separate
      paragraphs at the end of file, instead of producing correct LaTeX
      commands. Cannot distinguish between empty table cell and end of table
      row.




 SEE ALSO
      xls2csv(1), catppt(1), cat(1), strings(1), utf(4), unicode(4)





                                    - 5 -         Formatted:  April 20, 2024






 catdoc(1)                     MS-Word reader                      catdoc(1)
                          Version @catdoc_version@



 AUTHOR
      V.B.Wagner <vitus@45.free.net>




















































                                    - 6 -         Formatted:  April 20, 2024