packages icon

			libunicode
			==========

Introduction
============

Libunicode offers low-level Unicode (UTF-16) text processing functionality,
which can be divided into three categories:

	o Character handling
        o String handling
        o Charsets handling

Libunicode uses ISO/IEC 10646-defined UTF-16 encoding for storing and
minipulating all character entities. It will supports other encoding
standards (e.g., UTF-8, ISO 8859-x, etc.) for input and output only.

Libunicode bases, where applicable, on "Single Unix Specification,
Version 2(R)" (susv2) as API and semantics reference. susv2 is the
unification and superset of de jure POSIX and ANSI C (run-time library
part) and de facto BSD standards. This means that, if you know
standard character and string handling functions, you can readily use
libunicode; and, if you have apllication using standard character/string
processing facilities, you may with minimal troubles make it Unicode-aware.
Also, don't let word "Unix" in standard name confuse you. Susv2, as same
as POSIX, is standard for *Open* operating systems, where MS Windows,
MacOs, etc. fit. Such name was choosen by OpenGroup, maintainer of
susv2, to unite and defend market sectors actively attacked by
Microsoft with its "decommodizing" tactics. Libunicode is bright example
of opposite approach, offering crossplatform portability and
comptability for Unix and Win32 systems. (*)

(*) Opinions presented in the paragraph above are solely opinion of
documentation author and should not be considered as reflecting real
state of the things.

Libunicode defines new type, 'Uchar', which can handle any non-surrogate
UTF-16 character without space overhead.

Library offer two APIs, one being precise remapping of susv2 functions,
and one offering slightly higher-level API, with automatic memory
management fully controlled by user.

Functions of 1st API (fully standard-compliant, the one you probably
will use) uses 'u_' prefix, e.g. standard

             char *strchr(const char *s, char c);

becomes

             Uchar *u_strchr(const Uchar *s, Uchar c);

Functions of 2nd API use 'uni_' prefix. They are conceived to be used
in special environments, for example, in Apache webserver modules.
Most functions has completely identical 'u_' and 'uni_' implementation,
but following have differring from standard argument structure and
semantics:

uni_strcat
uni_strncat
uni_strdup
uni_strndup
uni_strcpy
uni_strncpy

You should consult library reference for their full description.



Below is more detailed overview of three libunicode subsystems:

Character handling
==================                                              

libunicode implements following functions from susv2, by defining header:

<ctype.h>:

isalnum
isalpha
isascii
iscntrl
isdigit
isgraph
islower
isprint
ispunct
isspace
isupper
isxdigit
toascii
tolower
toupper
_tolower
_toupper


String handling
==================

libunicode implements following functions from susv2, by defining header:

<string.h>, functions pertinent to string manipulation:

strcat
strchr
strcmp
strcpy
strcspn
strdup
strlen
strncat
strncmp
strncpy
strpbrk
strrchr
strspn
strstr
strtok

not implemented:

strerror
strcoll
strtok_r
strxfrm

<strings.h>, functions pertinent to string manipulation:

index
rindex
strcasecmp
strncasecmp


Additionally, following functions are implemented:

strndup - duplicate no more than n characters of string


Charsets handling
=================

Functions of this level allow conversion from some external encoding
to UTF-16 and from UTF-16 to some external encoding.

Libunicode will support number of de jure/de facto encodings.
Currently supported encodings are:

UTF-8 (ISO/IEC 10646)



Matthew Parry
<mettw@bowerbird.com.au>
Paul Sokolovsky
<Paul.Sokolovsky@technologist.com>