
1: @node Character Set Handling, Locales, String and Array Utilities, Top 2: @c %MENU% Support for extended character sets 3: @chapter Character Set Handling 4: 5: @ifnottex 6: @macro cal{text} 7: \text\ 8: @end macro 9: @end ifnottex 10: 11: Character sets used in the early days of computing had only six, seven, 12: or eight bits for each character: there was never a case where more than 13: eight bits (one byte) were used to represent a single character. The 14: limitations of this approach became more apparent as more people 15: grappled with non-Roman character sets, where not all the characters 16: that make up a language's character set can be represented by @math{2^8} 17: choices. This chapter shows the functionality that was added to the C 18: library to support multiple character sets. 19: 20: @menu 21: * Extended Char Intro:: Introduction to Extended Characters. 22: * Charset Function Overview:: Overview about Character Handling 23: Functions. 24: * Restartable multibyte conversion:: Restartable multibyte conversion 25: Functions. 26: * Non-reentrant Conversion:: Non-reentrant Conversion Function. 27: * Generic Charset Conversion:: Generic Charset Conversion. 28: @end menu 29: 30: 31: @node Extended Char Intro 32: @section Introduction to Extended Characters 33: 34: A variety of solutions is available to overcome the differences between 35: character sets with a 1:1 relation between bytes and characters and 36: character sets with ratios of 2:1 or 4:1. The remainder of this 37: section gives a few examples to help understand the design decisions 38: made while developing the functionality of the @w{C library}. 39: 40: @cindex internal representation 41: A distinction we have to make right away is between internal and 42: external representation. @dfn{Internal representation} means the 43: representation used by a program while keeping the text in memory. 44: External representations are used when text is stored or transmitted 45: through some communication channel. Examples of external 46: representations include files waiting in a directory to be 47: read and parsed. 48: 49: Traditionally there has been no difference between the two representations. 50: It was equally comfortable and useful to use the same single-byte 51: representation internally and externally. This comfort level decreases 52: with more and larger character sets. 53: 54: One of the problems to overcome with the internal representation is 55: handling text that is externally encoded using different character 56: sets. Assume a program that reads two texts and compares them using 57: some metric. The comparison can be usefully done only if the texts are 58: internally kept in a common format. 59: 60: @cindex wide character 61: For such a common format (@math{=} character set) eight bits are certainly 62: no longer enough. So the smallest entity will have to grow: @dfn{wide 63: characters} will now be used. Instead of one byte per character, two or 64: four will be used instead. (Three are not good to address in memory and 65: more than four bytes seem not to be necessary). 66: 67: @cindex Unicode 68: @cindex ISO 10646 69: As shown in some other part of this manual, 70: @c !!! Ahem, wide char string functions are not yet covered -- drepper 71: a completely new family has been created of functions that can handle wide 72: character texts in memory. The most commonly used character sets for such 73: internal wide character representations are Unicode and @w{ISO 10646} 74: (also known as UCS for Universal Character Set). Unicode was originally 75: planned as a 16-bit character set; whereas, @w{ISO 10646} was designed to 76: be a 31-bit large code space. The two standards are practically identical. 77: They have the same character repertoire and code table, but Unicode specifies 78: added semantics. At the moment, only characters in the first @code{0x10000} 79: code positions (the so-called Basic Multilingual Plane, BMP) have been 80: assigned, but the assignment of more specialized characters outside this 81: 16-bit space is already in progress. A number of encodings have been 82: defined for Unicode and @w{ISO 10646} characters: 83: @cindex UCS-2 84: @cindex UCS-4 85: @cindex UTF-8 86: @cindex UTF-16 87: UCS-2 is a 16-bit word that can only represent characters 88: from the BMP, UCS-4 is a 32-bit word than can represent any Unicode 89: and @w{ISO 10646} character, UTF-8 is an ASCII compatible encoding where 90: ASCII characters are represented by ASCII bytes and non-ASCII characters 91: by sequences of 2-6 non-ASCII bytes, and finally UTF-16 is an extension 92: of UCS-2 in which pairs of certain UCS-2 words can be used to encode 93: non-BMP characters up to @code{0x10ffff}. 94: 95: To represent wide characters the @code{char} type is not suitable. For 96: this reason the @w{ISO C} standard introduces a new type that is 97: designed to keep one character of a wide character string. To maintain 98: the similarity there is also a type corresponding to @code{int} for 99: those functions that take a single wide character. 100: 101: @comment stddef.h 102: @comment ISO 103: @deftp {Data type} wchar_t 104: This data type is used as the base type for wide character strings. 105: In other words, arrays of objects of this type are the equivalent of 106: @code{char[]} for multibyte character strings. The type is defined in 107: @file{stddef.h}. 108: 109: The @w{ISO C90} standard, where @code{wchar_t} was introduced, does not 110: say anything specific about the representation. It only requires that 111: this type is capable of storing all elements of the basic character set. 112: Therefore it would be legitimate to define @code{wchar_t} as @code{char}, 113: which might make sense for embedded systems. 114: 115: But for GNU systems @code{wchar_t} is always 32 bits wide and, therefore, 116: capable of representing all UCS-4 values and, therefore, covering all of 117: @w{ISO 10646}. Some Unix systems define @code{wchar_t} as a 16-bit type 118: and thereby follow Unicode very strictly. This definition is perfectly 119: fine with the standard, but it also means that to represent all 120: characters from Unicode and @w{ISO 10646} one has to use UTF-16 surrogate 121: characters, which is in fact a multi-wide-character encoding. But 122: resorting to multi-wide-character encoding contradicts the purpose of the 123: @code{wchar_t} type. 124: @end deftp 125: 126: @comment wchar.h 127: @comment ISO 128: @deftp {Data type} wint_t 129: @code{wint_t} is a data type used for parameters and variables that 130: contain a single wide character. As the name suggests this type is the 131: equivalent of @code{int} when using the normal @code{char} strings. The 132: types @code{wchar_t} and @code{wint_t} often have the same 133: representation if their size is 32 bits wide but if @code{wchar_t} is 134: defined as @code{char} the type @code{wint_t} must be defined as 135: @code{int} due to the parameter promotion. 136: 137: @pindex wchar.h 138: This type is defined in @file{wchar.h} and was introduced in 139: @w{Amendment 1} to @w{ISO C90}. 140: @end deftp 141: 142: As there are for the @code{char} data type macros are available for 143: specifying the minimum and maximum value representable in an object of 144: type @code{wchar_t}. 145: 146: @comment wchar.h 147: @comment ISO 148: @deftypevr Macro wint_t WCHAR_MIN 149: The macro @code{WCHAR_MIN} evaluates to the minimum value representable 150: by an object of type @code{wint_t}. 151: 152: This macro was introduced in @w{Amendment 1} to @w{ISO C90}. 153: @end deftypevr 154: 155: @comment wchar.h 156: @comment ISO 157: @deftypevr Macro wint_t WCHAR_MAX 158: The macro @code{WCHAR_MAX} evaluates to the maximum value representable 159: by an object of type @code{wint_t}. 160: 161: This macro was introduced in @w{Amendment 1} to @w{ISO C90}. 162: @end deftypevr 163: 164: Another special wide character value is the equivalent to @code{EOF}. 165: 166: @comment wchar.h 167: @comment ISO 168: @deftypevr Macro wint_t WEOF 169: The macro @code{WEOF} evaluates to a constant expression of type 170: @code{wint_t} whose value is different from any member of the extended 171: character set. 172: 173: @code{WEOF} need not be the same value as @code{EOF} and unlike 174: @code{EOF} it also need @emph{not} be negative. In other words, sloppy 175: code like 176: 177: @smallexample 178: @{ 179: int c; 180: @dots{} 181: while ((c = getc (fp)) < 0) 182: @dots{} 183: @} 184: @end smallexample 185: 186: @noindent 187: has to be rewritten to use @code{WEOF} explicitly when wide characters 188: are used: 189: 190: @smallexample 191: @{ 192: wint_t c; 193: @dots{} 194: while ((c = wgetc (fp)) != WEOF) 195: @dots{} 196: @} 197: @end smallexample 198: 199: @pindex wchar.h 200: This macro was introduced in @w{Amendment 1} to @w{ISO C90} and is 201: defined in @file{wchar.h}. 202: @end deftypevr 203: 204: 205: These internal representations present problems when it comes to storing 206: and transmittal. Because each single wide character consists of more 207: than one byte, they are effected by byte-ordering. Thus, machines with 208: different endianesses would see different values when accessing the same 209: data. This byte ordering concern also applies for communication protocols 210: that are all byte-based and therefore require that the sender has to 211: decide about splitting the wide character in bytes. A last (but not least 212: important) point is that wide characters often require more storage space 213: than a customized byte-oriented character set. 214: 215: @cindex multibyte character 216: @cindex EBCDIC 217: For all the above reasons, an external encoding that is different from 218: the internal encoding is often used if the latter is UCS-2 or UCS-4. 219: The external encoding is byte-based and can be chosen appropriately for 220: the environment and for the texts to be handled. A variety of different 221: character sets can be used for this external encoding (information that 222: will not be exhaustively presented here--instead, a description of the 223: major groups will suffice). All of the ASCII-based character sets 224: fulfill one requirement: they are "filesystem safe." This means that 225: the character @code{'/'} is used in the encoding @emph{only} to 226: represent itself. Things are a bit different for character sets like 227: EBCDIC (Extended Binary Coded Decimal Interchange Code, a character set 228: family used by IBM), but if the operation system does not understand 229: EBCDIC directly the parameters-to-system calls have to be converted 230: first anyhow. 231: 232: @itemize @bullet 233: @item 234: The simplest character sets are single-byte character sets. There can 235: be only up to 256 characters (for @w{8 bit} character sets), which is 236: not sufficient to cover all languages but might be sufficient to handle 237: a specific text. Handling of a @w{8 bit} character sets is simple. This 238: is not true for other kinds presented later, and therefore, the 239: application one uses might require the use of @w{8 bit} character sets. 240: 241: @cindex ISO 2022 242: @item 243: The @w{ISO 2022} standard defines a mechanism for extended character 244: sets where one character @emph{can} be represented by more than one 245: byte. This is achieved by associating a state with the text. 246: Characters that can be used to change the state can be embedded in the 247: text. Each byte in the text might have a different interpretation in each 248: state. The state might even influence whether a given byte stands for a 249: character on its own or whether it has to be combined with some more 250: bytes. 251: 252: @cindex EUC 253: @cindex Shift_JIS 254: @cindex SJIS 255: In most uses of @w{ISO 2022} the defined character sets do not allow 256: state changes that cover more than the next character. This has the 257: big advantage that whenever one can identify the beginning of the byte 258: sequence of a character one can interpret a text correctly. Examples of 259: character sets using this policy are the various EUC character sets 260: (used by Sun's operations systems, EUC-JP, EUC-KR, EUC-TW, and EUC-CN) 261: or Shift_JIS (SJIS, a Japanese encoding). 262: 263: But there are also character sets using a state that is valid for more 264: than one character and has to be changed by another byte sequence. 265: Examples for this are ISO-2022-JP, ISO-2022-KR, and ISO-2022-CN. 266: 267: @item 268: @cindex ISO 6937 269: Early attempts to fix 8 bit character sets for other languages using the 270: Roman alphabet lead to character sets like @w{ISO 6937}. Here bytes 271: representing characters like the acute accent do not produce output 272: themselves: one has to combine them with other characters to get the 273: desired result. For example, the byte sequence @code{0xc2 0x61} 274: (non-spacing acute accent, followed by lower-case `a') to get the ``small 275: a with acute'' character. To get the acute accent character on its own, 276: one has to write @code{0xc2 0x20} (the non-spacing acute followed by a 277: space). 278: 279: Character sets like @w{ISO 6937} are used in some embedded systems such 280: as teletex. 281: 282: @item 283: @cindex UTF-8 284: Instead of converting the Unicode or @w{ISO 10646} text used internally, 285: it is often also sufficient to simply use an encoding different than 286: UCS-2/UCS-4. The Unicode and @w{ISO 10646} standards even specify such an 287: encoding: UTF-8. This encoding is able to represent all of @w{ISO 288: 10646} 31 bits in a byte string of length one to six. 289: 290: @cindex UTF-7 291: There were a few other attempts to encode @w{ISO 10646} such as UTF-7, 292: but UTF-8 is today the only encoding that should be used. In fact, with 293: any luck UTF-8 will soon be the only external encoding that has to be 294: supported. It proves to be universally usable and its only disadvantage 295: is that it favors Roman languages by making the byte string 296: representation of other scripts (Cyrillic, Greek, Asian scripts) longer 297: than necessary if using a specific character set for these scripts. 298: Methods like the Unicode compression scheme can alleviate these 299: problems. 300: @end itemize 301: 302: The question remaining is: how to select the character set or encoding 303: to use. The answer: you cannot decide about it yourself, it is decided 304: by the developers of the system or the majority of the users. Since the 305: goal is interoperability one has to use whatever the other people one 306: works with use. If there are no constraints, the selection is based on 307: the requirements the expected circle of users will have. In other words, 308: if a project is expected to be used in only, say, Russia it is fine to use 309: KOI8-R or a similar character set. But if at the same time people from, 310: say, Greece are participating one should use a character set that allows 311: all people to collaborate. 312: 313: The most widely useful solution seems to be: go with the most general 314: character set, namely @w{ISO 10646}. Use UTF-8 as the external encoding 315: and problems about users not being able to use their own language 316: adequately are a thing of the past. 317: 318: One final comment about the choice of the wide character representation 319: is necessary at this point. We have said above that the natural choice 320: is using Unicode or @w{ISO 10646}. This is not required, but at least 321: encouraged, by the @w{ISO C} standard. The standard defines at least a 322: macro @code{__STDC_ISO_10646__} that is only defined on systems where 323: the @code{wchar_t} type encodes @w{ISO 10646} characters. If this 324: symbol is not defined one should avoid making assumptions about the wide 325: character representation. If the programmer uses only the functions 326: provided by the C library to handle wide character strings there should 327: be no compatibility problems with other systems. 328: 329: @node Charset Function Overview 330: @section Overview about Character Handling Functions 331: 332: A Unix @w{C library} contains three different sets of functions in two 333: families to handle character set conversion. One of the function families 334: (the most commonly used) is specified in the @w{ISO C90} standard and, 335: therefore, is portable even beyond the Unix world. Unfortunately this 336: family is the least useful one. These functions should be avoided 337: whenever possible, especially when developing libraries (as opposed to 338: applications). 339: 340: The second family of functions got introduced in the early Unix standards 341: (XPG2) and is still part of the latest and greatest Unix standard: 342: @w{Unix 98}. It is also the most powerful and useful set of functions. 343: But we will start with the functions defined in @w{Amendment 1} to 344: @w{ISO C90}. 345: 346: @node Restartable multibyte conversion 347: @section Restartable Multibyte Conversion Functions 348: 349: The @w{ISO C} standard defines functions to convert strings from a 350: multibyte representation to wide character strings. There are a number 351: of peculiarities: 352: 353: @itemize @bullet 354: @item 355: The character set assumed for the multibyte encoding is not specified 356: as an argument to the functions. Instead the character set specified by 357: the @code{LC_CTYPE} category of the current locale is used; see 358: @ref{Locale Categories}. 359: 360: @item 361: The functions handling more than one character at a time require NUL 362: terminated strings as the argument (i.e., converting blocks of text 363: does not work unless one can add a NUL byte at an appropriate place). 364: The GNU C library contains some extensions to the standard that allow 365: specifying a size, but basically they also expect terminated strings. 366: @end itemize 367: 368: Despite these limitations the @w{ISO C} functions can be used in many 369: contexts. In graphical user interfaces, for instance, it is not 370: uncommon to have functions that require text to be displayed in a wide 371: character string if the text is not simple ASCII. The text itself might 372: come from a file with translations and the user should decide about the 373: current locale, which determines the translation and therefore also the 374: external encoding used. In such a situation (and many others) the 375: functions described here are perfect. If more freedom while performing 376: the conversion is necessary take a look at the @code{iconv} functions 377: (@pxref{Generic Charset Conversion}). 378: 379: @menu 380: * Selecting the Conversion:: Selecting the conversion and its properties. 381: * Keeping the state:: Representing the state of the conversion. 382: * Converting a Character:: Converting Single Characters. 383: * Converting Strings:: Converting Multibyte and Wide Character 384: Strings. 385: * Multibyte Conversion Example:: A Complete Multibyte Conversion Example. 386: @end menu 387: 388: @node Selecting the Conversion 389: @subsection Selecting the conversion and its properties 390: 391: We already said above that the currently selected locale for the 392: @code{LC_CTYPE} category decides about the conversion that is performed 393: by the functions we are about to describe. Each locale uses its own 394: character set (given as an argument to @code{localedef}) and this is the 395: one assumed as the external multibyte encoding. The wide character 396: character set always is UCS-4, at least on GNU systems. 397: 398: A characteristic of each multibyte character set is the maximum number 399: of bytes that can be necessary to represent one character. This 400: information is quite important when writing code that uses the 401: conversion functions (as shown in the examples below). 402: The @w{ISO C} standard defines two macros that provide this information. 403: 404: 405: @comment limits.h 406: @comment ISO 407: @deftypevr Macro int MB_LEN_MAX 408: @code{MB_LEN_MAX} specifies the maximum number of bytes in the multibyte 409: sequence for a single character in any of the supported locales. It is 410: a compile-time constant and is defined in @file{limits.h}. 411: @pindex limits.h 412: @end deftypevr 413: 414: @comment stdlib.h 415: @comment ISO 416: @deftypevr Macro int MB_CUR_MAX 417: @code{MB_CUR_MAX} expands into a positive integer expression that is the 418: maximum number of bytes in a multibyte character in the current locale. 419: The value is never greater than @code{MB_LEN_MAX}. Unlike 420: @code{MB_LEN_MAX} this macro need not be a compile-time constant, and in 421: the GNU C library it is not. 422: 423: @pindex stdlib.h 424: @code{MB_CUR_MAX} is defined in @file{stdlib.h}. 425: @end deftypevr 426: 427: Two different macros are necessary since strictly @w{ISO C90} compilers 428: do not allow variable length array definitions, but still it is desirable 429: to avoid dynamic allocation. This incomplete piece of code shows the 430: problem: 431: 432: @smallexample 433: @{ 434: char buf[MB_LEN_MAX]; 435: ssize_t len = 0; 436: 437: while (! feof (fp)) 438: @{ 439: fread (&buf[len], 1, MB_CUR_MAX - len, fp); 440: /* @r{@dots{} process} buf */ 441: len -= used; 442: @} 443: @} 444: @end smallexample 445: 446: The code in the inner loop is expected to have always enough bytes in 447: the array @var{buf} to convert one multibyte character. The array 448: @var{buf} has to be sized statically since many compilers do not allow a 449: variable size. The @code{fread} call makes sure that @code{MB_CUR_MAX} 450: bytes are always available in @var{buf}. Note that it isn't 451: a problem if @code{MB_CUR_MAX} is not a compile-time constant. 452: 453: 454: @node Keeping the state 455: @subsection Representing the state of the conversion 456: 457: @cindex stateful 458: In the introduction of this chapter it was said that certain character 459: sets use a @dfn{stateful} encoding. That is, the encoded values depend 460: in some way on the previous bytes in the text. 461: 462: Since the conversion functions allow converting a text in more than one 463: step we must have a way to pass this information from one call of the 464: functions to another. 465: 466: @comment wchar.h 467: @comment ISO 468: @deftp {Data type} mbstate_t 469: @cindex shift state 470: A variable of type @code{mbstate_t} can contain all the information 471: about the @dfn{shift state} needed from one call to a conversion 472: function to another. 473: 474: @pindex wchar.h 475: @code{mbstate_t} is defined in @file{wchar.h}. It was introduced in 476: @w{Amendment 1} to @w{ISO C90}. 477: @end deftp 478: 479: To use objects of type @code{mbstate_t} the programmer has to define such 480: objects (normally as local variables on the stack) and pass a pointer to 481: the object to the conversion functions. This way the conversion function 482: can update the object if the current multibyte character set is stateful. 483: 484: There is no specific function or initializer to put the state object in 485: any specific state. The rules are that the object should always 486: represent the initial state before the first use, and this is achieved by 487: clearing the whole variable with code such as follows: 488: 489: @smallexample 490: @{ 491: mbstate_t state; 492: memset (&state, '\0', sizeof (state)); 493: /* @r{from now on @var{state} can be used.} */ 494: @dots{} 495: @} 496: @end smallexample 497: 498: When using the conversion functions to generate output it is often 499: necessary to test whether the current state corresponds to the initial 500: state. This is necessary, for example, to decide whether to emit 501: escape sequences to set the state to the initial state at certain 502: sequence points. Communication protocols often require this. 503: 504: @comment wchar.h 505: @comment ISO 506: @deftypefun int mbsinit (const mbstate_t *@var{ps}) 507: The @code{mbsinit} function determines whether the state object pointed 508: to by @var{ps} is in the initial state. If @var{ps} is a null pointer or 509: the object is in the initial state the return value is nonzero. Otherwise 510: it is zero. 511: 512: @pindex wchar.h 513: @code{mbsinit} was introduced in @w{Amendment 1} to @w{ISO C90} and is 514: declared in @file{wchar.h}. 515: @end deftypefun 516: 517: Code using @code{mbsinit} often looks similar to this: 518: 519: @c Fix the example to explicitly say how to generate the escape sequence 520: @c to restore the initial state. 521: @smallexample 522: @{ 523: mbstate_t state; 524: memset (&state, '\0', sizeof (state)); 525: /* @r{Use @var{state}.} */ 526: @dots{} 527: if (! mbsinit (&state)) 528: @{ 529: /* @r{Emit code to return to initial state.} */ 530: const wchar_t empty[] = L""; 531: const wchar_t *srcp = empty; 532: wcsrtombs (outbuf, &srcp, outbuflen, &state); 533: @} 534: @dots{} 535: @} 536: @end smallexample 537: 538: The code to emit the escape sequence to get back to the initial state is 539: interesting. The @code{wcsrtombs} function can be used to determine the 540: necessary output code (@pxref{Converting Strings}). Please note that on 541: GNU systems it is not necessary to perform this extra action for the 542: conversion from multibyte text to wide character text since the wide 543: character encoding is not stateful. But there is nothing mentioned in 544: any standard that prohibits making @code{wchar_t} using a stateful 545: encoding. 546: 547: @node Converting a Character 548: @subsection Converting Single Characters 549: 550: The most fundamental of the conversion functions are those dealing with 551: single characters. Please note that this does not always mean single 552: bytes. But since there is very often a subset of the multibyte 553: character set that consists of single byte sequences, there are 554: functions to help with converting bytes. Frequently, ASCII is a subpart 555: of the multibyte character set. In such a scenario, each ASCII character 556: stands for itself, and all other characters have at least a first byte 557: that is beyond the range @math{0} to @math{127}. 558: 559: @comment wchar.h 560: @comment ISO 561: @deftypefun wint_t btowc (int @var{c}) 562: The @code{btowc} function (``byte to wide character'') converts a valid 563: single byte character @var{c} in the initial shift state into the wide 564: character equivalent using the conversion rules from the currently 565: selected locale of the @code{LC_CTYPE} category. 566: 567: If @code{(unsigned char) @var{c}} is no valid single byte multibyte 568: character or if @var{c} is @code{EOF}, the function returns @code{WEOF}. 569: 570: Please note the restriction of @var{c} being tested for validity only in 571: the initial shift state. No @code{mbstate_t} object is used from 572: which the state information is taken, and the function also does not use 573: any static state. 574: 575: @pindex wchar.h 576: The @code{btowc} function was introduced in @w{Amendment 1} to @w{ISO C90} 577: and is declared in @file{wchar.h}. 578: @end deftypefun 579: 580: Despite the limitation that the single byte value always is interpreted 581: in the initial state this function is actually useful most of the time. 582: Most characters are either entirely single-byte character sets or they 583: are extension to ASCII. But then it is possible to write code like this 584: (not that this specific example is very useful): 585: 586: @smallexample 587: wchar_t * 588: itow (unsigned long int val) 589: @{ 590: static wchar_t buf[30]; 591: wchar_t *wcp = &buf[29]; 592: *wcp = L'\0'; 593: while (val != 0) 594: @{ 595: *--wcp = btowc ('0' + val % 10); 596: val /= 10; 597: @} 598: if (wcp == &buf[29]) 599: *--wcp = L'0'; 600: return wcp; 601: @} 602: @end smallexample 603: 604: Why is it necessary to use such a complicated implementation and not 605: simply cast @code{'0' + val % 10} to a wide character? The answer is 606: that there is no guarantee that one can perform this kind of arithmetic 607: on the character of the character set used for @code{wchar_t} 608: representation. In other situations the bytes are not constant at 609: compile time and so the compiler cannot do the work. In situations like 610: this it is necessary @code{btowc}. 611: 612: @noindent 613: There also is a function for the conversion in the other direction. 614: 615: @comment wchar.h 616: @comment ISO 617: @deftypefun int wctob (wint_t @var{c}) 618: The @code{wctob} function (``wide character to byte'') takes as the 619: parameter a valid wide character. If the multibyte representation for 620: this character in the initial state is exactly one byte long, the return 621: value of this function is this character. Otherwise the return value is 622: @code{EOF}. 623: 624: @pindex wchar.h 625: @code{wctob} was introduced in @w{Amendment 1} to @w{ISO C90} and 626: is declared in @file{wchar.h}. 627: @end deftypefun 628: 629: There are more general functions to convert single character from 630: multibyte representation to wide characters and vice versa. These 631: functions pose no limit on the length of the multibyte representation 632: and they also do not require it to be in the initial state. 633: 634: @comment wchar.h 635: @comment ISO 636: @deftypefun size_t mbrtowc (wchar_t *restrict @var{pwc}, const char *restrict @var{s}, size_t @var{n}, mbstate_t *restrict @var{ps}) 637: @cindex stateful 638: The @code{mbrtowc} function (``multibyte restartable to wide 639: character'') converts the next multibyte character in the string pointed 640: to by @var{s} into a wide character and stores it in the wide character 641: string pointed to by @var{pwc}. The conversion is performed according 642: to the locale currently selected for the @code{LC_CTYPE} category. If 643: the conversion for the character set used in the locale requires a state, 644: the multibyte string is interpreted in the state represented by the 645: object pointed to by @var{ps}. If @var{ps} is a null pointer, a static, 646: internal state variable used only by the @code{mbrtowc} function is 647: used. 648: 649: If the next multibyte character corresponds to the NUL wide character, 650: the return value of the function is @math{0} and the state object is 651: afterwards in the initial state. If the next @var{n} or fewer bytes 652: form a correct multibyte character, the return value is the number of 653: bytes starting from @var{s} that form the multibyte character. The 654: conversion state is updated according to the bytes consumed in the 655: conversion. In both cases the wide character (either the @code{L'\0'} 656: or the one found in the conversion) is stored in the string pointed to 657: by @var{pwc} if @var{pwc} is not null. 658: 659: If the first @var{n} bytes of the multibyte string possibly form a valid 660: multibyte character but there are more than @var{n} bytes needed to 661: complete it, the return value of the function is @code{(size_t) -2} and 662: no value is stored. Please note that this can happen even if @var{n} 663: has a value greater than or equal to @code{MB_CUR_MAX} since the input 664: might contain redundant shift sequences. 665: 666: If the first @code{n} bytes of the multibyte string cannot possibly form 667: a valid multibyte character, no value is stored, the global variable 668: @code{errno} is set to the value @code{EILSEQ}, and the function returns 669: @code{(size_t) -1}. The conversion state is afterwards undefined. 670: 671: @pindex wchar.h 672: @code{mbrtowc} was introduced in @w{Amendment 1} to @w{ISO C90} and 673: is declared in @file{wchar.h}. 674: @end deftypefun 675: 676: Use of @code{mbrtowc} is straightforward. A function that copies a 677: multibyte string into a wide character string while at the same time 678: converting all lowercase characters into uppercase could look like this 679: (this is not the final version, just an example; it has no error 680: checking, and sometimes leaks memory): 681: 682: @smallexample 683: wchar_t * 684: mbstouwcs (const char *s) 685: @{ 686: size_t len = strlen (s); 687: wchar_t *result = malloc ((len + 1) * sizeof (wchar_t)); 688: wchar_t *wcp = result; 689: wchar_t tmp[1]; 690: mbstate_t state; 691: size_t nbytes; 692: 693: memset (&state, '\0', sizeof (state)); 694: while ((nbytes = mbrtowc (tmp, s, len, &state)) > 0) 695: @{ 696: if (nbytes >= (size_t) -2) 697: /* Invalid input string. */ 698: return NULL; 699: *wcp++ = towupper (tmp[0]); 700: len -= nbytes; 701: s += nbytes; 702: @} 703: return result; 704: @} 705: @end smallexample 706: 707: The use of @code{mbrtowc} should be clear. A single wide character is 708: stored in @code{@var{tmp}[0]}, and the number of consumed bytes is stored 709: in the variable @var{nbytes}. If the conversion is successful, the 710: uppercase variant of the wide character is stored in the @var{result} 711: array and the pointer to the input string and the number of available 712: bytes is adjusted. 713: 714: The only non-obvious thing about @code{mbrtowc} might be the way memory 715: is allocated for the result. The above code uses the fact that there 716: can never be more wide characters in the converted results than there are 717: bytes in the multibyte input string. This method yields a pessimistic 718: guess about the size of the result, and if many wide character strings 719: have to be constructed this way or if the strings are long, the extra 720: memory required to be allocated because the input string contains 721: multibyte characters might be significant. The allocated memory block can 722: be resized to the correct size before returning it, but a better solution 723: might be to allocate just the right amount of space for the result right 724: away. Unfortunately there is no function to compute the length of the wide 725: character string directly from the multibyte string. There is, however, a 726: function that does part of the work. 727: 728: @comment wchar.h 729: @comment ISO 730: @deftypefun size_t mbrlen (const char *restrict @var{s}, size_t @var{n}, mbstate_t *@var{ps}) 731: The @code{mbrlen} function (``multibyte restartable length'') computes 732: the number of at most @var{n} bytes starting at @var{s}, which form the 733: next valid and complete multibyte character. 734: