HXPIPE(1) 8.x HXPIPE(1)
HTML-XML-utils HTML-XML-utils
10 Feb 2022
NAME
hxpipe - convert XML file to a format easier to parse with Perl or AWK
SYNOPSIS
hxpipe [ -l ] [ -H ] [ -- ] [ file-or-URL ]
DESCRIPTION
hxpipe parses an HTML or XML file and outputs a line-oriented
representation of it that is well suited to further processing with
AWK or similar tools. The format is similar to the ESIS (Element
Structure Information Set) that is output by nsgmls/onsgmls. The
reverse operation, converting back to mark-up, is performed by the
hxunpipe program. The output format is as follows:
<!--comment-->
Comments are output as
S
*comment
E
I.e., a single line starting with "*" followed by the text
of the comment. Line feeds, carriage returns and tabs in the
text are written as "\n", "\r" and "\t", respectively. Text
that looks like a numerical character entity is written with
the "&" replaced by "\". The line ends with a line feed.
Note that onsgmls outputs comments starting with a "_"
instead of a "*" and doesn't replace the "&" of numerical
character entities by "\" (and by default it omits comments
altogether).
<?processing instruction>
Processing instructions are output as
S
?processing instruction
E
I.e., a single line starting with a "?" followed by the text
of the processing instruction. The text is escaped as for
comments (see above).
<!DOCTYPE root PUBLIC "-//foo//DTD bar//EN" "http://example.org/dtd">
DOCTYPEs are output as one of the following:
S
!root "-//foo//DTD bar//EN" http://example.org/dtd
!root "-//foo//DTD bar//EN"
- 1 - Formatted: April 16, 2026
HXPIPE(1) 8.x HXPIPE(1)
HTML-XML-utils HTML-XML-utils
10 Feb 2022
!root "" http://example.org/dtd
!root ""
E
for respectively: a DOCTYPE with (1) both a public and a
system identifier, (2) only a public identifier, (3) only a
system identifier, or (4) neither of the two. I.e., a single
line starting with a "!", followed by a space and a possibly
empty quoted string, followed optionally by a space and
arbitrary text. Note the quotes for the public identifier
and the absence of quotes for the system identifier.
<elt att1="value1" att2="value2">
A start tag is output as
S
Aatt1 CDATA value1
Aatt2 CDATA value2
(elt
E
I.e., as zero or more lines for the attributes and one line
for the element type. Each line for an attribute starts with
"A" followed by the name of the attribute, a space, the
literal string "CDATA", another space, and the attribute
value. The text of the attribute value is escaped as for
comments (see above). The line for the element type starts
with "(" followed by the element type.
hxpipe does not read DTDs and assumes that attributes are
always CDATA. It never generates other types (IMPLIED,
TOKEN, ID, etc.), unlike onsgmls.
</elt> End tags are output as
S
)elt
E
I.e., as a line starting with ")" followed by the element
type.
<empty att1="val1" att2="val2"/>
Empty elements are output as
S
Aatt1 CDATA val1
Aatt2 CDATA val2
|empty
- 2 - Formatted: April 16, 2026
HXPIPE(1) 8.x HXPIPE(1)
HTML-XML-utils HTML-XML-utils
10 Feb 2022
E
I.e., as zero or more lines for attributes and one line
starting with "|" followed by the element type.
Note that onsgmls never outputs "|". (However, it can
optionally output a line consisting of a single "e" just
before the "(" line, to indicate that the element is empty.)
text Text is output as
S
-text
E
I.e., as a single line starting with a "-". The text is
escaped as for comments (see above).
line numbers
When the -l option is in effect, hxpipe will intersperse the
output with lines of the form
S
L12
E
where "12" is replaced with the line number in the source
where the next output came from. hxpipe normalizes the
input only in the sense that it outputs attributes in a
fixed order (alphabetical, but not locale-dependent). It
does not read a DTD and thus cannot remove redundant white
space and cannot add implied attributes. It does not expand
character entities. (But you can pipe the input through
hxunent beforehand.) It also does not add implied tags. (But
see the -H option.)
OPTIONS
The following options are supported:
-l Add "L" lines to the output to indicate the line numbers in
the source. Currently does not work together with the -H
option.
-H Apply special rules for HTML. Normally, hxpipe assumes
well-formed XML. With this option, hxpipe will assume the
input is HTML and will add implied tags, recognize empty
elements and treat the contents of <script> and <style>
elements as literal text.
- 3 - Formatted: April 16, 2026
HXPIPE(1) 8.x HXPIPE(1)
HTML-XML-utils HTML-XML-utils
10 Feb 2022
OPERANDS
The following operand is supported:
file-or-URL
The name or URL of an HTML file. If absent, standard input
is read instead.
EXIT STATUS
The following exit values are returned:
0 Successful completion.
> 0 An error occurred in the parsing of the HTML file. hxpipe
will try to correct the error and produce output anyway.
ENVIRONMENT
To use a proxy to retrieve remote files, set the environment variables
http_proxy and ftp_proxy. E.g., http_proxy="http://localhost:8080/"
BUGS
The error recovery for incorrect HTML is primitive. hxpipe can
currently only retrieve remote files over HTTP. It doesn't handle
password-protected files, nor files whose content depends on HTTP
"cookies." Option -l ought to work also with HTML input (option -H).
SEE ALSO
hxunpipe(1), hxunent(1), onsgmls(1).
- 4 - Formatted: April 16, 2026