(linenum→info "unix/slp.c:2238")

glibc/2.7/manual/charset.texi

    1: @node Character Set Handling, Locales, String and Array Utilities, Top
    2: @c %MENU% Support for extended character sets
    3: @chapter Character Set Handling
    4: 
    5: @ifnottex
    6: @macro cal{text}
    7: \text\
    8: @end macro
    9: @end ifnottex
   10: 
   11: Character sets used in the early days of computing had only six, seven,
   12: or eight bits for each character: there was never a case where more than
   13: eight bits (one byte) were used to represent a single character.  The
   14: limitations of this approach became more apparent as more people
   15: grappled with non-Roman character sets, where not all the characters
   16: that make up a language's character set can be represented by @math{2^8}
   17: choices.  This chapter shows the functionality that was added to the C
   18: library to support multiple character sets.
   19: 
   20: @menu
   21: * Extended Char Intro::              Introduction to Extended Characters.
   22: * Charset Function Overview::        Overview about Character Handling
   23:                                       Functions.
   24: * Restartable multibyte conversion:: Restartable multibyte conversion
   25:                                       Functions.
   26: * Non-reentrant Conversion::         Non-reentrant Conversion Function.
   27: * Generic Charset Conversion::       Generic Charset Conversion.
   28: @end menu
   29: 
   30: 
   31: @node Extended Char Intro
   32: @section Introduction to Extended Characters
   33: 
   34: A variety of solutions is available to overcome the differences between
   35: character sets with a 1:1 relation between bytes and characters and
   36: character sets with ratios of 2:1 or 4:1.  The remainder of this
   37: section gives a few examples to help understand the design decisions
   38: made while developing the functionality of the @w{C library}.
   39: 
   40: @cindex internal representation
   41: A distinction we have to make right away is between internal and
   42: external representation.  @dfn{Internal representation} means the
   43: representation used by a program while keeping the text in memory.
   44: External representations are used when text is stored or transmitted
   45: through some communication channel.  Examples of external
   46: representations include files waiting in a directory to be
   47: read and parsed.
   48: 
   49: Traditionally there has been no difference between the two representations.
   50: It was equally comfortable and useful to use the same single-byte
   51: representation internally and externally.  This comfort level decreases
   52: with more and larger character sets.
   53: 
   54: One of the problems to overcome with the internal representation is
   55: handling text that is externally encoded using different character
   56: sets.  Assume a program that reads two texts and compares them using
   57: some metric.  The comparison can be usefully done only if the texts are
   58: internally kept in a common format.
   59: 
   60: @cindex wide character
   61: For such a common format (@math{=} character set) eight bits are certainly
   62: no longer enough.  So the smallest entity will have to grow: @dfn{wide
   63: characters} will now be used.  Instead of one byte per character, two or
   64: four will be used instead.  (Three are not good to address in memory and
   65: more than four bytes seem not to be necessary).
   66: 
   67: @cindex Unicode
   68: @cindex ISO 10646
   69: As shown in some other part of this manual,
   70: @c !!! Ahem, wide char string functions are not yet covered -- drepper
   71: a completely new family has been created of functions that can handle wide
   72: character texts in memory.  The most commonly used character sets for such
   73: internal wide character representations are Unicode and @w{ISO 10646}
   74: (also known as UCS for Universal Character Set).  Unicode was originally
   75: planned as a 16-bit character set; whereas, @w{ISO 10646} was designed to
   76: be a 31-bit large code space.  The two standards are practically identical.
   77: They have the same character repertoire and code table, but Unicode specifies
   78: added semantics.  At the moment, only characters in the first @code{0x10000}
   79: code positions (the so-called Basic Multilingual Plane, BMP) have been
   80: assigned, but the assignment of more specialized characters outside this
   81: 16-bit space is already in progress.  A number of encodings have been
   82: defined for Unicode and @w{ISO 10646} characters:
   83: @cindex UCS-2
   84: @cindex UCS-4
   85: @cindex UTF-8
   86: @cindex UTF-16
   87: UCS-2 is a 16-bit word that can only represent characters
   88: from the BMP, UCS-4 is a 32-bit word than can represent any Unicode
   89: and @w{ISO 10646} character, UTF-8 is an ASCII compatible encoding where
   90: ASCII characters are represented by ASCII bytes and non-ASCII characters
   91: by sequences of 2-6 non-ASCII bytes, and finally UTF-16 is an extension
   92: of UCS-2 in which pairs of certain UCS-2 words can be used to encode
   93: non-BMP characters up to @code{0x10ffff}.
   94: 
   95: To represent wide characters the @code{char} type is not suitable.  For
   96: this reason the @w{ISO C} standard introduces a new type that is
   97: designed to keep one character of a wide character string.  To maintain
   98: the similarity there is also a type corresponding to @code{int} for
   99: those functions that take a single wide character.
  100: 
  101: @comment stddef.h
  102: @comment ISO
  103: @deftp {Data type} wchar_t
  104: This data type is used as the base type for wide character strings.
  105: In other words, arrays of objects of this type are the equivalent of
  106: @code{char[]} for multibyte character strings.  The type is defined in
  107: @file{stddef.h}.
  108: 
  109: The @w{ISO C90} standard, where @code{wchar_t} was introduced, does not
  110: say anything specific about the representation.  It only requires that
  111: this type is capable of storing all elements of the basic character set.
  112: Therefore it would be legitimate to define @code{wchar_t} as @code{char},
  113: which might make sense for embedded systems.
  114: 
  115: But for GNU systems @code{wchar_t} is always 32 bits wide and, therefore,
  116: capable of representing all UCS-4 values and, therefore, covering all of
  117: @w{ISO 10646}.  Some Unix systems define @code{wchar_t} as a 16-bit type
  118: and thereby follow Unicode very strictly.  This definition is perfectly
  119: fine with the standard, but it also means that to represent all
  120: characters from Unicode and @w{ISO 10646} one has to use UTF-16 surrogate
  121: characters, which is in fact a multi-wide-character encoding.  But
  122: resorting to multi-wide-character encoding contradicts the purpose of the
  123: @code{wchar_t} type.
  124: @end deftp
  125: 
  126: @comment wchar.h
  127: @comment ISO
  128: @deftp {Data type} wint_t
  129: @code{wint_t} is a data type used for parameters and variables that
  130: contain a single wide character.  As the name suggests this type is the
  131: equivalent of @code{int} when using the normal @code{char} strings.  The
  132: types @code{wchar_t} and @code{wint_t} often have the same
  133: representation if their size is 32 bits wide but if @code{wchar_t} is
  134: defined as @code{char} the type @code{wint_t} must be defined as
  135: @code{int} due to the parameter promotion.
  136: 
  137: @pindex wchar.h
  138: This type is defined in @file{wchar.h} and was introduced in
  139: @w{Amendment 1} to @w{ISO C90}.
  140: @end deftp
  141: 
  142: As there are for the @code{char} data type macros are available for
  143: specifying the minimum and maximum value representable in an object of
  144: type @code{wchar_t}.
  145: 
  146: @comment wchar.h
  147: @comment ISO
  148: @deftypevr Macro wint_t WCHAR_MIN
  149: The macro @code{WCHAR_MIN} evaluates to the minimum value representable
  150: by an object of type @code{wint_t}.
  151: 
  152: This macro was introduced in @w{Amendment 1} to @w{ISO C90}.
  153: @end deftypevr
  154: 
  155: @comment wchar.h
  156: @comment ISO
  157: @deftypevr Macro wint_t WCHAR_MAX
  158: The macro @code{WCHAR_MAX} evaluates to the maximum value representable
  159: by an object of type @code{wint_t}.
  160: 
  161: This macro was introduced in @w{Amendment 1} to @w{ISO C90}.
  162: @end deftypevr
  163: 
  164: Another special wide character value is the equivalent to @code{EOF}.
  165: 
  166: @comment wchar.h
  167: @comment ISO
  168: @deftypevr Macro wint_t WEOF
  169: The macro @code{WEOF} evaluates to a constant expression of type
  170: @code{wint_t} whose value is different from any member of the extended
  171: character set.
  172: 
  173: @code{WEOF} need not be the same value as @code{EOF} and unlike
  174: @code{EOF} it also need @emph{not} be negative.  In other words, sloppy
  175: code like
  176: 
  177: @smallexample
  178: @{
  179:   int c;
  180:   @dots{}
  181:   while ((c = getc (fp)) < 0)
  182:     @dots{}
  183: @}
  184: @end smallexample
  185: 
  186: @noindent
  187: has to be rewritten to use @code{WEOF} explicitly when wide characters
  188: are used:
  189: 
  190: @smallexample
  191: @{
  192:   wint_t c;
  193:   @dots{}
  194:   while ((c = wgetc (fp)) != WEOF)
  195:     @dots{}
  196: @}
  197: @end smallexample
  198: 
  199: @pindex wchar.h
  200: This macro was introduced in @w{Amendment 1} to @w{ISO C90} and is
  201: defined in @file{wchar.h}.
  202: @end deftypevr
  203: 
  204: 
  205: These internal representations present problems when it comes to storing
  206: and transmittal.  Because each single wide character consists of more
  207: than one byte, they are effected by byte-ordering.  Thus, machines with
  208: different endianesses would see different values when accessing the same
  209: data.  This byte ordering concern also applies for communication protocols
  210: that are all byte-based and therefore require that the sender has to
  211: decide about splitting the wide character in bytes.  A last (but not least
  212: important) point is that wide characters often require more storage space
  213: than a customized byte-oriented character set.
  214: 
  215: @cindex multibyte character
  216: @cindex EBCDIC
  217: For all the above reasons, an external encoding that is different from
  218: the internal encoding is often used if the latter is UCS-2 or UCS-4.
  219: The external encoding is byte-based and can be chosen appropriately for
  220: the environment and for the texts to be handled.  A variety of different
  221: character sets can be used for this external encoding (information that
  222: will not be exhaustively presented here--instead, a description of the
  223: major groups will suffice).  All of the ASCII-based character sets
  224: fulfill one requirement: they are "filesystem safe."  This means that
  225: the character @code{'/'} is used in the encoding @emph{only} to
  226: represent itself.  Things are a bit different for character sets like
  227: EBCDIC (Extended Binary Coded Decimal Interchange Code, a character set
  228: family used by IBM), but if the operation system does not understand
  229: EBCDIC directly the parameters-to-system calls have to be converted
  230: first anyhow.
  231: 
  232: @itemize @bullet
  233: @item
  234: The simplest character sets are single-byte character sets.  There can
  235: be only up to 256 characters (for @w{8 bit} character sets), which is
  236: not sufficient to cover all languages but might be sufficient to handle
  237: a specific text.  Handling of a @w{8 bit} character sets is simple.  This
  238: is not true for other kinds presented later, and therefore, the
  239: application one uses might require the use of @w{8 bit} character sets.
  240: 
  241: @cindex ISO 2022
  242: @item
  243: The @w{ISO 2022} standard defines a mechanism for extended character
  244: sets where one character @emph{can} be represented by more than one
  245: byte.  This is achieved by associating a state with the text.
  246: Characters that can be used to change the state can be embedded in the
  247: text.  Each byte in the text might have a different interpretation in each
  248: state.  The state might even influence whether a given byte stands for a
  249: character on its own or whether it has to be combined with some more
  250: bytes.
  251: 
  252: @cindex EUC
  253: @cindex Shift_JIS
  254: @cindex SJIS
  255: In most uses of @w{ISO 2022} the defined character sets do not allow
  256: state changes that cover more than the next character.  This has the
  257: big advantage that whenever one can identify the beginning of the byte
  258: sequence of a character one can interpret a text correctly.  Examples of
  259: character sets using this policy are the various EUC character sets
  260: (used by Sun's operations systems, EUC-JP, EUC-KR, EUC-TW, and EUC-CN)
  261: or Shift_JIS (SJIS, a Japanese encoding).
  262: 
  263: But there are also character sets using a state that is valid for more
  264: than one character and has to be changed by another byte sequence.
  265: Examples for this are ISO-2022-JP, ISO-2022-KR, and ISO-2022-CN.
  266: 
  267: @item
  268: @cindex ISO 6937
  269: Early attempts to fix 8 bit character sets for other languages using the
  270: Roman alphabet lead to character sets like @w{ISO 6937}.  Here bytes
  271: representing characters like the acute accent do not produce output
  272: themselves: one has to combine them with other characters to get the
  273: desired result.  For example, the byte sequence @code{0xc2 0x61}
  274: (non-spacing acute accent, followed by lower-case `a') to get the ``small
  275: a with  acute'' character.  To get the acute accent character on its own,
  276: one has to write @code{0xc2 0x20} (the non-spacing acute followed by a
  277: space).
  278: 
  279: Character sets like @w{ISO 6937} are used in some embedded systems such
  280: as teletex.
  281: 
  282: @item
  283: @cindex UTF-8
  284: Instead of converting the Unicode or @w{ISO 10646} text used internally,
  285: it is often also sufficient to simply use an encoding different than
  286: UCS-2/UCS-4.  The Unicode and @w{ISO 10646} standards even specify such an
  287: encoding: UTF-8.  This encoding is able to represent all of @w{ISO
  288: 10646} 31 bits in a byte string of length one to six.
  289: 
  290: @cindex UTF-7
  291: There were a few other attempts to encode @w{ISO 10646} such as UTF-7,
  292: but UTF-8 is today the only encoding that should be used.  In fact, with
  293: any luck UTF-8 will soon be the only external encoding that has to be
  294: supported.  It proves to be universally usable and its only disadvantage
  295: is that it favors Roman languages by making the byte string
  296: representation of other scripts (Cyrillic, Greek, Asian scripts) longer
  297: than necessary if using a specific character set for these scripts.
  298: Methods like the Unicode compression scheme can alleviate these
  299: problems.
  300: @end itemize
  301: 
  302: The question remaining is: how to select the character set or encoding
  303: to use.  The answer: you cannot decide about it yourself, it is decided
  304: by the developers of the system or the majority of the users.  Since the
  305: goal is interoperability one has to use whatever the other people one
  306: works with use.  If there are no constraints, the selection is based on
  307: the requirements the expected circle of users will have.  In other words,
  308: if a project is expected to be used in only, say, Russia it is fine to use
  309: KOI8-R or a similar character set.  But if at the same time people from,
  310: say, Greece are participating one should use a character set that allows
  311: all people to collaborate.
  312: 
  313: The most widely useful solution seems to be: go with the most general
  314: character set, namely @w{ISO 10646}.  Use UTF-8 as the external encoding
  315: and problems about users not being able to use their own language
  316: adequately are a thing of the past.
  317: 
  318: One final comment about the choice of the wide character representation
  319: is necessary at this point.  We have said above that the natural choice
  320: is using Unicode or @w{ISO 10646}.  This is not required, but at least
  321: encouraged, by the @w{ISO C} standard.  The standard defines at least a
  322: macro @code{__STDC_ISO_10646__} that is only defined on systems where
  323: the @code{wchar_t} type encodes @w{ISO 10646} characters.  If this
  324: symbol is not defined one should avoid making assumptions about the wide
  325: character representation.  If the programmer uses only the functions
  326: provided by the C library to handle wide character strings there should
  327: be no compatibility problems with other systems.
  328: 
  329: @node Charset Function Overview
  330: @section Overview about Character Handling Functions
  331: 
  332: A Unix @w{C library} contains three different sets of functions in two
  333: families to handle character set conversion.  One of the function families
  334: (the most commonly used) is specified in the @w{ISO C90} standard and,
  335: therefore, is portable even beyond the Unix world.  Unfortunately this
  336: family is the least useful one.  These functions should be avoided
  337: whenever possible, especially when developing libraries (as opposed to
  338: applications).
  339: 
  340: The second family of functions got introduced in the early Unix standards
  341: (XPG2) and is still part of the latest and greatest Unix standard:
  342: @w{Unix 98}.  It is also the most powerful and useful set of functions.
  343: But we will start with the functions defined in @w{Amendment 1} to
  344: @w{ISO C90}.
  345: 
  346: @node Restartable multibyte conversion
  347: @section Restartable Multibyte Conversion Functions
  348: 
  349: The @w{ISO C} standard defines functions to convert strings from a
  350: multibyte representation to wide character strings.  There are a number
  351: of peculiarities:
  352: 
  353: @itemize @bullet
  354: @item
  355: The character set assumed for the multibyte encoding is not specified
  356: as an argument to the functions.  Instead the character set specified by
  357: the @code{LC_CTYPE} category of the current locale is used; see
  358: @ref{Locale Categories}.
  359: 
  360: @item
  361: The functions handling more than one character at a time require NUL
  362: terminated strings as the argument (i.e., converting blocks of text
  363: does not work unless one can add a NUL byte at an appropriate place).
  364: The GNU C library contains some extensions to the standard that allow
  365: specifying a size, but basically they also expect terminated strings.
  366: @end itemize
  367: 
  368: Despite these limitations the @w{ISO C} functions can be used in many
  369: contexts.  In graphical user interfaces, for instance, it is not
  370: uncommon to have functions that require text to be displayed in a wide
  371: character string if the text is not simple ASCII.  The text itself might
  372: come from a file with translations and the user should decide about the
  373: current locale, which determines the translation and therefore also the
  374: external encoding used.  In such a situation (and many others) the
  375: functions described here are perfect.  If more freedom while performing
  376: the conversion is necessary take a look at the @code{iconv} functions
  377: (@pxref{Generic Charset Conversion}).
  378: 
  379: @menu
  380: * Selecting the Conversion::     Selecting the conversion and its properties.
  381: * Keeping the state::            Representing the state of the conversion.
  382: * Converting a Character::       Converting Single Characters.
  383: * Converting Strings::           Converting Multibyte and Wide Character
  384:                                   Strings.
  385: * Multibyte Conversion Example:: A Complete Multibyte Conversion Example.
  386: @end menu
  387: 
  388: @node Selecting the Conversion
  389: @subsection Selecting the conversion and its properties
  390: 
  391: We already said above that the currently selected locale for the
  392: @code{LC_CTYPE} category decides about the conversion that is performed
  393: by the functions we are about to describe.  Each locale uses its own
  394: character set (given as an argument to @code{localedef}) and this is the
  395: one assumed as the external multibyte encoding.  The wide character
  396: character set always is UCS-4, at least on GNU systems.
  397: 
  398: A characteristic of each multibyte character set is the maximum number
  399: of bytes that can be necessary to represent one character.  This
  400: information is quite important when writing code that uses the
  401: conversion functions (as shown in the examples below).
  402: The @w{ISO C} standard defines two macros that provide this information.
  403: 
  404: 
  405: @comment limits.h
  406: @comment ISO
  407: @deftypevr Macro int MB_LEN_MAX
  408: @code{MB_LEN_MAX} specifies the maximum number of bytes in the multibyte
  409: sequence for a single character in any of the supported locales.  It is
  410: a compile-time constant and is defined in @file{limits.h}.
  411: @pindex limits.h
  412: @end deftypevr
  413: 
  414: @comment stdlib.h
  415: @comment ISO
  416: @deftypevr Macro int MB_CUR_MAX
  417: @code{MB_CUR_MAX} expands into a positive integer expression that is the
  418: maximum number of bytes in a multibyte character in the current locale.
  419: The value is never greater than @code{MB_LEN_MAX}.  Unlike
  420: @code{MB_LEN_MAX} this macro need not be a compile-time constant, and in
  421: the GNU C library it is not.
  422: 
  423: @pindex stdlib.h
  424: @code{MB_CUR_MAX} is defined in @file{stdlib.h}.
  425: @end deftypevr
  426: 
  427: Two different macros are necessary since strictly @w{ISO C90} compilers
  428: do not allow variable length array definitions, but still it is desirable
  429: to avoid dynamic allocation.  This incomplete piece of code shows the
  430: problem:
  431: 
  432: @smallexample
  433: @{
  434:   char buf[MB_LEN_MAX];
  435:   ssize_t len = 0;
  436: 
  437:   while (! feof (fp))
  438:     @{
  439:       fread (&buf[len], 1, MB_CUR_MAX - len, fp);
  440:       /* @r{@dots{} process} buf */
  441:       len -= used;
  442:     @}
  443: @}
  444: @end smallexample
  445: 
  446: The code in the inner loop is expected to have always enough bytes in
  447: the array @var{buf} to convert one multibyte character.  The array
  448: @var{buf} has to be sized statically since many compilers do not allow a
  449: variable size.  The @code{fread} call makes sure that @code{MB_CUR_MAX}
  450: bytes are always available in @var{buf}.  Note that it isn't
  451: a problem if @code{MB_CUR_MAX} is not a compile-time constant.
  452: 
  453: 
  454: @node Keeping the state
  455: @subsection Representing the state of the conversion
  456: 
  457: @cindex stateful
  458: In the introduction of this chapter it was said that certain character
  459: sets use a @dfn{stateful} encoding.  That is, the encoded values depend
  460: in some way on the previous bytes in the text.
  461: 
  462: Since the conversion functions allow converting a text in more than one
  463: step we must have a way to pass this information from one call of the
  464: functions to another.
  465: 
  466: @comment wchar.h
  467: @comment ISO
  468: @deftp {Data type} mbstate_t
  469: @cindex shift state
  470: A variable of type @code{mbstate_t} can contain all the information
  471: about the @dfn{shift state} needed from one call to a conversion
  472: function to another.
  473: 
  474: @pindex wchar.h
  475: @code{mbstate_t} is defined in @file{wchar.h}.  It was introduced in
  476: @w{Amendment 1} to @w{ISO C90}.
  477: @end deftp
  478: 
  479: To use objects of type @code{mbstate_t} the programmer has to define such
  480: objects (normally as local variables on the stack) and pass a pointer to
  481: the object to the conversion functions.  This way the conversion function
  482: can update the object if the current multibyte character set is stateful.
  483: 
  484: There is no specific function or initializer to put the state object in
  485: any specific state.  The rules are that the object should always
  486: represent the initial state before the first use, and this is achieved by
  487: clearing the whole variable with code such as follows:
  488: 
  489: @smallexample
  490: @{
  491:   mbstate_t state;
  492:   memset (&state, '\0', sizeof (state));
  493:   /* @r{from now on @var{state} can be used.}  */
  494:   @dots{}
  495: @}
  496: @end smallexample
  497: 
  498: When using the conversion functions to generate output it is often
  499: necessary to test whether the current state corresponds to the initial
  500: state.  This is necessary, for example, to decide whether to emit
  501: escape sequences to set the state to the initial state at certain
  502: sequence points.  Communication protocols often require this.
  503: 
  504: @comment wchar.h
  505: @comment ISO
  506: @deftypefun int mbsinit (const mbstate_t *@var{ps})
  507: The @code{mbsinit} function determines whether the state object pointed
  508: to by @var{ps} is in the initial state.  If @var{ps} is a null pointer or
  509: the object is in the initial state the return value is nonzero.  Otherwise
  510: it is zero.
  511: 
  512: @pindex wchar.h
  513: @code{mbsinit} was introduced in @w{Amendment 1} to @w{ISO C90} and is
  514: declared in @file{wchar.h}.
  515: @end deftypefun
  516: 
  517: Code using @code{mbsinit} often looks similar to this:
  518: 
  519: @c Fix the example to explicitly say how to generate the escape sequence
  520: @c to restore the initial state.
  521: @smallexample
  522: @{
  523:   mbstate_t state;
  524:   memset (&state, '\0', sizeof (state));
  525:   /* @r{Use @var{state}.}  */
  526:   @dots{}
  527:   if (! mbsinit (&state))
  528:     @{
  529:       /* @r{Emit code to return to initial state.}  */
  530:       const wchar_t empty[] = L"";
  531:       const wchar_t *srcp = empty;
  532:       wcsrtombs (outbuf, &srcp, outbuflen, &state);
  533:     @}
  534:   @dots{}
  535: @}
  536: @end smallexample
  537: 
  538: The code to emit the escape sequence to get back to the initial state is
  539: interesting.  The @code{wcsrtombs} function can be used to determine the
  540: necessary output code (@pxref{Converting Strings}).  Please note that on
  541: GNU systems it is not necessary to perform this extra action for the
  542: conversion from multibyte text to wide character text since the wide
  543: character encoding is not stateful.  But there is nothing mentioned in
  544: any standard that prohibits making @code{wchar_t} using a stateful
  545: encoding.
  546: 
  547: @node Converting a Character
  548: @subsection Converting Single Characters
  549: 
  550: The most fundamental of the conversion functions are those dealing with
  551: single characters.  Please note that this does not always mean single
  552: bytes.  But since there is very often a subset of the multibyte
  553: character set that consists of single byte sequences, there are
  554: functions to help with converting bytes.  Frequently, ASCII is a subpart
  555: of the multibyte character set.  In such a scenario, each ASCII character
  556: stands for itself, and all other characters have at least a first byte
  557: that is beyond the range @math{0} to @math{127}.
  558: 
  559: @comment wchar.h
  560: @comment ISO
  561: @deftypefun wint_t btowc (int @var{c})
  562: The @code{btowc} function (``byte to wide character'') converts a valid
  563: single byte character @var{c} in the initial shift state into the wide
  564: character equivalent using the conversion rules from the currently
  565: selected locale of the @code{LC_CTYPE} category.
  566: 
  567: If @code{(unsigned char) @var{c}} is no valid single byte multibyte
  568: character or if @var{c} is @code{EOF}, the function returns @code{WEOF}.
  569: 
  570: Please note the restriction of @var{c} being tested for validity only in
  571: the initial shift state.  No @code{mbstate_t} object is used from
  572: which the state information is taken, and the function also does not use
  573: any static state.
  574: 
  575: @pindex wchar.h
  576: The @code{btowc} function was introduced in @w{Amendment 1} to @w{ISO C90}
  577: and is declared in @file{wchar.h}.
  578: @end deftypefun
  579: 
  580: Despite the limitation that the single byte value always is interpreted
  581: in the initial state this function is actually useful most of the time.
  582: Most characters are either entirely single-byte character sets or they
  583: are extension to ASCII.  But then it is possible to write code like this
  584: (not that this specific example is very useful):
  585: 
  586: @smallexample
  587: wchar_t *
  588: itow (unsigned long int val)
  589: @{
  590:   static wchar_t buf[30];
  591:   wchar_t *wcp = &buf[29];
  592:   *wcp = L'\0';
  593:   while (val != 0)
  594:     @{
  595:       *--wcp = btowc ('0' + val % 10);
  596:       val /= 10;
  597:     @}
  598:   if (wcp == &buf[29])
  599:     *--wcp = L'0';
  600:   return wcp;
  601: @}
  602: @end smallexample
  603: 
  604: Why is it necessary to use such a complicated implementation and not
  605: simply cast @code{'0' + val % 10} to a wide character?  The answer is
  606: that there is no guarantee that one can perform this kind of arithmetic
  607: on the character of the character set used for @code{wchar_t}
  608: representation.  In other situations the bytes are not constant at
  609: compile time and so the compiler cannot do the work.  In situations like
  610: this it is necessary @code{btowc}.
  611: 
  612: @noindent
  613: There also is a function for the conversion in the other direction.
  614: 
  615: @comment wchar.h
  616: @comment ISO
  617: @deftypefun int wctob (wint_t @var{c})
  618: The @code{wctob} function (``wide character to byte'') takes as the
  619: parameter a valid wide character.  If the multibyte representation for
  620: this character in the initial state is exactly one byte long, the return
  621: value of this function is this character.  Otherwise the return value is
  622: @code{EOF}.
  623: 
  624: @pindex wchar.h
  625: @code{wctob} was introduced in @w{Amendment 1} to @w{ISO C90} and
  626: is declared in @file{wchar.h}.
  627: @end deftypefun
  628: 
  629: There are more general functions to convert single character from
  630: multibyte representation to wide characters and vice versa.  These
  631: functions pose no limit on the length of the multibyte representation
  632: and they also do not require it to be in the initial state.
  633: 
  634: @comment wchar.h
  635: @comment ISO
  636: @deftypefun size_t mbrtowc (wchar_t *restrict @var{pwc}, const char *restrict @var{s}, size_t @var{n}, mbstate_t *restrict @var{ps})
  637: @cindex stateful
  638: The @code{mbrtowc} function (``multibyte restartable to wide
  639: character'') converts the next multibyte character in the string pointed
  640: to by @var{s} into a wide character and stores it in the wide character
  641: string pointed to by @var{pwc}.  The conversion is performed according
  642: to the locale currently selected for the @code{LC_CTYPE} category.  If
  643: the conversion for the character set used in the locale requires a state,
  644: the multibyte string is interpreted in the state represented by the
  645: object pointed to by @var{ps}.  If @var{ps} is a null pointer, a static,
  646: internal state variable used only by the @code{mbrtowc} function is
  647: used.
  648: 
  649: If the next multibyte character corresponds to the NUL wide character,
  650: the return value of the function is @math{0} and the state object is
  651: afterwards in the initial state.  If the next @var{n} or fewer bytes
  652: form a correct multibyte character, the return value is the number of
  653: bytes starting from @var{s} that form the multibyte character.  The
  654: conversion state is updated according to the bytes consumed in the
  655: conversion.  In both cases the wide character (either the @code{L'\0'}
  656: or the one found in the conversion) is stored in the string pointed to
  657: by @var{pwc} if @var{pwc} is not null.
  658: 
  659: If the first @var{n} bytes of the multibyte string possibly form a valid
  660: multibyte character but there are more than @var{n} bytes needed to
  661: complete it, the return value of the function is @code{(size_t) -2} and
  662: no value is stored.  Please note that this can happen even if @var{n}
  663: has a value greater than or equal to @code{MB_CUR_MAX} since the input
  664: might contain redundant shift sequences.
  665: 
  666: If the first @code{n} bytes of the multibyte string cannot possibly form
  667: a valid multibyte character, no value is stored, the global variable
  668: @code{errno} is set to the value @code{EILSEQ}, and the function returns
  669: @code{(size_t) -1}.  The conversion state is afterwards undefined.
  670: 
  671: @pindex wchar.h
  672: @code{mbrtowc} was introduced in @w{Amendment 1} to @w{ISO C90} and
  673: is declared in @file{wchar.h}.
  674: @end deftypefun
  675: 
  676: Use of @code{mbrtowc} is straightforward.  A function that copies a
  677: multibyte string into a wide character string while at the same time
  678: converting all lowercase characters into uppercase could look like this
  679: (this is not the final version, just an example; it has no error
  680: checking, and sometimes leaks memory):
  681: 
  682: @smallexample
  683: wchar_t *
  684: mbstouwcs (const char *s)
  685: @{
  686:   size_t len = strlen (s);
  687:   wchar_t *result = malloc ((len + 1) * sizeof (wchar_t));
  688:   wchar_t *wcp = result;
  689:   wchar_t tmp[1];
  690:   mbstate_t state;
  691:   size_t nbytes;
  692: 
  693:   memset (&state, '\0', sizeof (state));
  694:   while ((nbytes = mbrtowc (tmp, s, len, &state)) > 0)
  695:     @{
  696:       if (nbytes >= (size_t) -2)
  697:         /* Invalid input string.  */
  698:         return NULL;
  699:       *wcp++ = towupper (tmp[0]);
  700:       len -= nbytes;
  701:       s += nbytes;
  702:     @}
  703:   return result;
  704: @}
  705: @end smallexample
  706: 
  707: The use of @code{mbrtowc} should be clear.  A single wide character is
  708: stored in @code{@var{tmp}[0]}, and the number of consumed bytes is stored
  709: in the variable @var{nbytes}.  If the conversion is successful, the
  710: uppercase variant of the wide character is stored in the @var{result}
  711: array and the pointer to the input string and the number of available
  712: bytes is adjusted.
  713: 
  714: The only non-obvious thing about @code{mbrtowc} might be the way memory
  715: is allocated for the result.  The above code uses the fact that there
  716: can never be more wide characters in the converted results than there are
  717: bytes in the multibyte input string.  This method yields a pessimistic
  718: guess about the size of the result, and if many wide character strings
  719: have to be constructed this way or if the strings are long, the extra
  720: memory required to be allocated because the input string contains
  721: multibyte characters might be significant.  The allocated memory block can
  722: be resized to the correct size before returning it, but a better solution
  723: might be to allocate just the right amount of space for the result right
  724: away.  Unfortunately there is no function to compute the length of the wide
  725: character string directly from the multibyte string.  There is, however, a
  726: function that does part of the work.
  727: 
  728: @comment wchar.h
  729: @comment ISO
  730: @deftypefun size_t mbrlen (const char *restrict @var{s}, size_t @var{n}, mbstate_t *@var{ps})
  731: The @code{mbrlen} function (``multibyte restartable length'') computes
  732: the number of at most @var{n} bytes starting at @var{s}, which form the
  733: next valid and complete multibyte character.
  734: