packages icon



 HXPIPE(1)                           8.x                           HXPIPE(1)
 HTML-XML-utils                                               HTML-XML-utils

                                 10 Feb 2022



 NAME
      hxpipe - convert XML file to a format easier to parse with Perl or AWK

 SYNOPSIS
      hxpipe [ -l ] [ -H ] [ -- ] [ file-or-URL ]

 DESCRIPTION
      hxpipe parses an HTML or XML file and outputs a line-oriented
      representation of it that is well suited to further processing with
      AWK or similar tools. The format is similar to the ESIS (Element
      Structure Information Set) that is output by nsgmls/onsgmls.  The
      reverse operation, converting back to mark-up, is performed by the
      hxunpipe program.  The output format is as follows:

      <!--comment-->
                Comments are output as

                    S
                    *comment
                    E

                I.e., a single line starting with "*" followed by the text
                of the comment. Line feeds, carriage returns and tabs in the
                text are written as "\n", "\r" and "\t", respectively. Text
                that looks like a numerical character entity is written with
                the "&" replaced by "\".  The line ends with a line feed.

                Note that onsgmls outputs comments starting with a "_"
                instead of a "*" and doesn't replace the "&" of numerical
                character entities by "\" (and by default it omits comments
                altogether).

      <?processing instruction>
                Processing instructions are output as

                    S
                    ?processing instruction
                    E

                I.e., a single line starting with a "?" followed by the text
                of the processing instruction. The text is escaped as for
                comments (see above).

      <!DOCTYPE root PUBLIC "-//foo//DTD bar//EN" "http://example.org/dtd">
                DOCTYPEs are output as one of the following:

                    S
                    !root "-//foo//DTD bar//EN" http://example.org/dtd
                    !root "-//foo//DTD bar//EN"



                                    - 1 -           Formatted:  May 17, 2024






 HXPIPE(1)                           8.x                           HXPIPE(1)
 HTML-XML-utils                                               HTML-XML-utils

                                 10 Feb 2022



                    !root "" http://example.org/dtd
                    !root ""
                    E

                for respectively: a DOCTYPE with (1) both a public and a
                system identifier, (2) only a public identifier, (3) only a
                system identifier, or (4) neither of the two. I.e., a single
                line starting with a "!", followed by a space and a possibly
                empty quoted string, followed optionally by a space and
                arbitrary text. Note the quotes for the public identifier
                and the absence of quotes for the system identifier.

      <elt att1="value1" att2="value2">
                A start tag is output as

                    S
                    Aatt1 CDATA value1
                    Aatt2 CDATA value2
                    (elt
                    E

                I.e., as zero or more lines for the attributes and one line
                for the element type. Each line for an attribute starts with
                "A" followed by the name of the attribute, a space, the
                literal string "CDATA", another space, and the attribute
                value. The text of the attribute value is escaped as for
                comments (see above). The line for the element type starts
                with "(" followed by the element type.

                hxpipe does not read DTDs and assumes that attributes are
                always CDATA. It never generates other types (IMPLIED,
                TOKEN, ID, etc.), unlike onsgmls.

      </elt>    End tags are output as

                    S
                    )elt
                    E

                I.e., as a line starting with ")" followed by the element
                type.

      <empty att1="val1" att2="val2"/>
                Empty elements are output as

                    S
                    Aatt1 CDATA val1
                    Aatt2 CDATA val2
                    |empty



                                    - 2 -           Formatted:  May 17, 2024






 HXPIPE(1)                           8.x                           HXPIPE(1)
 HTML-XML-utils                                               HTML-XML-utils

                                 10 Feb 2022



                    E

                I.e., as zero or more lines for attributes and one line
                starting with "|" followed by the element type.

                Note that onsgmls never outputs "|". (However, it can
                optionally output a line consisting of a single "e" just
                before the "(" line, to indicate that the element is empty.)

      text      Text is output as

                    S
                    -text
                    E

                I.e., as a single line starting with a "-". The text is
                escaped as for comments (see above).

      line numbers
                When the -l option is in effect, hxpipe will intersperse the
                output with lines of the form

                    S
                    L12
                    E

                where "12" is replaced with the line number in the source
                where the next output came from.  hxpipe normalizes the
                input only in the sense that it outputs attributes in a
                fixed order (alphabetical, but not locale-dependent). It
                does not read a DTD and thus cannot remove redundant white
                space and cannot add implied attributes. It does not expand
                character entities. (But you can pipe the input through
                hxunent beforehand.) It also does not add implied tags. (But
                see the -H option.)

 OPTIONS
      The following options are supported:

      -l        Add "L" lines to the output to indicate the line numbers in
                the source. Currently does not work together with the -H
                 option.

      -H        Apply special rules for HTML. Normally, hxpipe assumes
                well-formed XML. With this option, hxpipe will assume the
                input is HTML and will add implied tags, recognize empty
                elements and treat the contents of <script> and <style>
                elements as literal text.




                                    - 3 -           Formatted:  May 17, 2024






 HXPIPE(1)                           8.x                           HXPIPE(1)
 HTML-XML-utils                                               HTML-XML-utils

                                 10 Feb 2022



 OPERANDS
      The following operand is supported:

      file-or-URL
                The name or URL of an HTML file. If absent, standard input
                is read instead.

 EXIT STATUS
      The following exit values are returned:

      0         Successful completion.

      > 0       An error occurred in the parsing of the HTML file.  hxpipe
                will try to correct the error and produce output anyway.

 ENVIRONMENT
      To use a proxy to retrieve remote files, set the environment variables
      http_proxy and ftp_proxy.  E.g., http_proxy="http://localhost:8080/"

 BUGS
      The error recovery for incorrect HTML is primitive.  hxpipe can
      currently only retrieve remote files over HTTP. It doesn't handle
      password-protected files, nor files whose content depends on HTTP
      "cookies." Option -l ought to work also with HTML input (option -H).

 SEE ALSO
      hxunpipe(1), hxunent(1), onsgmls(1).

























                                    - 4 -           Formatted:  May 17, 2024