libunicode ========== Introduction ============ Libunicode offers low-level Unicode (UTF-16) text processing functionality, which can be divided into three categories: o Character handling o String handling o Charsets handling Libunicode uses ISO/IEC 10646-defined UTF-16 encoding for storing and minipulating all character entities. It will supports other encoding standards (e.g., UTF-8, ISO 8859-x, etc.) for input and output only. Libunicode bases, where applicable, on "Single Unix Specification, Version 2(R)" (susv2) as API and semantics reference. susv2 is the unification and superset of de jure POSIX and ANSI C (run-time library part) and de facto BSD standards. This means that, if you know standard character and string handling functions, you can readily use libunicode; and, if you have apllication using standard character/string processing facilities, you may with minimal troubles make it Unicode-aware. Also, don't let word "Unix" in standard name confuse you. Susv2, as same as POSIX, is standard for *Open* operating systems, where MS Windows, MacOs, etc. fit. Such name was choosen by OpenGroup, maintainer of susv2, to unite and defend market sectors actively attacked by Microsoft with its "decommodizing" tactics. Libunicode is bright example of opposite approach, offering crossplatform portability and comptability for Unix and Win32 systems. (*) (*) Opinions presented in the paragraph above are solely opinion of documentation author and should not be considered as reflecting real state of the things. Libunicode defines new type, 'Uchar', which can handle any non-surrogate UTF-16 character without space overhead. Library offer two APIs, one being precise remapping of susv2 functions, and one offering slightly higher-level API, with automatic memory management fully controlled by user. Functions of 1st API (fully standard-compliant, the one you probably will use) uses 'u_' prefix, e.g. standard char *strchr(const char *s, char c); becomes Uchar *u_strchr(const Uchar *s, Uchar c); Functions of 2nd API use 'uni_' prefix. They are conceived to be used in special environments, for example, in Apache webserver modules. Most functions has completely identical 'u_' and 'uni_' implementation, but following have differring from standard argument structure and semantics: uni_strcat uni_strncat uni_strdup uni_strndup uni_strcpy uni_strncpy You should consult library reference for their full description. Below is more detailed overview of three libunicode subsystems: Character handling ================== libunicode implements following functions from susv2, by defining header: <ctype.h>: isalnum isalpha isascii iscntrl isdigit isgraph islower isprint ispunct isspace isupper isxdigit toascii tolower toupper _tolower _toupper String handling ================== libunicode implements following functions from susv2, by defining header: <string.h>, functions pertinent to string manipulation: strcat strchr strcmp strcpy strcspn strdup strlen strncat strncmp strncpy strpbrk strrchr strspn strstr strtok not implemented: strerror strcoll strtok_r strxfrm <strings.h>, functions pertinent to string manipulation: index rindex strcasecmp strncasecmp Additionally, following functions are implemented: strndup - duplicate no more than n characters of string Charsets handling ================= Functions of this level allow conversion from some external encoding to UTF-16 and from UTF-16 to some external encoding. Libunicode will support number of de jure/de facto encodings. Currently supported encodings are: UTF-8 (ISO/IEC 10646) Matthew Parry <mettw@bowerbird.com.au> Paul Sokolovsky <Paul.Sokolovsky@technologist.com>