
1: /* Coding system handler (conversion, detection, and etc). 2: Copyright (C) 2001, 2002, 2003, 2004, 2005, 3: 2006, 2007 Free Software Foundation, Inc. 4: Copyright (C) 1995, 1996, 1997, 1998, 1999, 2000, 2001, 2002, 2003, 2004, 5: 2005, 2006, 2007 6: National Institute of Advanced Industrial Science and Technology (AIST) 7: Registration Number H14PRO021 8: 9: This file is part of GNU Emacs. 10: 11: GNU Emacs is free software; you can redistribute it and/or modify 12: it under the terms of the GNU General Public License as published by 13: the Free Software Foundation; either version 2, or (at your option) 14: any later version. 15: 16: GNU Emacs is distributed in the hope that it will be useful, 17: but WITHOUT ANY WARRANTY; without even the implied warranty of 18: MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the 19: GNU General Public License for more details. 20: 21: You should have received a copy of the GNU General Public License 22: along with GNU Emacs; see the file COPYING. If not, write to 23: the Free Software Foundation, Inc., 51 Franklin Street, Fifth Floor, 24: Boston, MA 02110-1301, USA. */ 25: 26: /*** TABLE OF CONTENTS *** 27: 28: 0. General comments 29: 1. Preamble 30: 2. Emacs' internal format (emacs-mule) handlers 31: 3. ISO2022 handlers 32: 4. Shift-JIS and BIG5 handlers 33: 5. CCL handlers 34: 6. End-of-line handlers 35: 7. C library functions 36: 8. Emacs Lisp library functions 37: 9. Post-amble 38: 39: */ 40: 41: /*** 0. General comments ***/ 42: 43: 44: /*** GENERAL NOTE on CODING SYSTEMS *** 45: 46: A coding system is an encoding mechanism for one or more character 47: sets. Here's a list of coding systems which Emacs can handle. When 48: we say "decode", it means converting some other coding system to 49: Emacs' internal format (emacs-mule), and when we say "encode", 50: it means converting the coding system emacs-mule to some other 51: coding system. 52: 53: 0. Emacs' internal format (emacs-mule) 54: 55: Emacs itself holds a multi-lingual character in buffers and strings 56: in a special format. Details are described in section 2. 57: 58: 1. ISO2022 59: 60: The most famous coding system for multiple character sets. X's 61: Compound Text, various EUCs (Extended Unix Code), and coding 62: systems used in Internet communication such as ISO-2022-JP are 63: all variants of ISO2022. Details are described in section 3. 64: 65: 2. SJIS (or Shift-JIS or MS-Kanji-Code) 66: 67: A coding system to encode character sets: ASCII, JISX0201, and 68: JISX0208. Widely used for PC's in Japan. Details are described in 69: section 4. 70: 71: 3. BIG5 72: 73: A coding system to encode the character sets ASCII and Big5. Widely 74: used for Chinese (mainly in Taiwan and Hong Kong). Details are 75: described in section 4. In this file, when we write "BIG5" 76: (all uppercase), we mean the coding system, and when we write 77: "Big5" (capitalized), we mean the character set. 78: 79: 4. Raw text 80: 81: A coding system for text containing random 8-bit code. Emacs does 82: no code conversion on such text except for end-of-line format. 83: 84: 5. Other 85: 86: If a user wants to read/write text encoded in a coding system not 87: listed above, he can supply a decoder and an encoder for it as CCL 88: (Code Conversion Language) programs. Emacs executes the CCL program 89: while reading/writing. 90: 91: Emacs represents a coding system by a Lisp symbol that has a property 92: `coding-system'. But, before actually using the coding system, the 93: information about it is set in a structure of type `struct 94: coding_system' for rapid processing. See section 6 for more details. 95: 96: */ 97: 98: /*** GENERAL NOTES on END-OF-LINE FORMAT *** 99: 100: How end-of-line of text is encoded depends on the operating system. 101: For instance, Unix's format is just one byte of `line-feed' code, 102: whereas DOS's format is two-byte sequence of `carriage-return' and 103: `line-feed' codes. MacOS's format is usually one byte of 104: `carriage-return'. 105: 106: Since text character encoding and end-of-line encoding are 107: independent, any coding system described above can have any 108: end-of-line format. So Emacs has information about end-of-line 109: format in each coding-system. See section 6 for more details. 110: 111: */ 112: 113: /*** GENERAL NOTES on `detect_coding_XXX ()' functions *** 114: 115: These functions check if a text between SRC and SRC_END is encoded 116: in the coding system category XXX. Each returns an integer value in 117: which appropriate flag bits for the category XXX are set. The flag 118: bits are defined in macros CODING_CATEGORY_MASK_XXX. Below is the 119: template for these functions. If MULTIBYTEP is nonzero, 8-bit codes 120: of the range 0x80..0x9F are in multibyte form. */ 121: #if 0 122: int 123: detect_coding_emacs_mule (src, src_end, multibytep) 124: unsigned char *src, *src_end; 125: int multibytep; 126: { 127: ... 128: } 129: #endif 130: 131: /*** GENERAL NOTES on `decode_coding_XXX ()' functions *** 132: 133: These functions decode SRC_BYTES length of unibyte text at SOURCE 134: encoded in CODING to Emacs' internal format. The resulting 135: multibyte text goes to a place pointed to by DESTINATION, the length 136: of which should not exceed DST_BYTES. 137: 138: These functions set the information about original and decoded texts 139: in the members `produced', `produced_char', `consumed', and 140: `consumed_char' of the structure *CODING. They also set the member 141: `result' to one of CODING_FINISH_XXX indicating how the decoding 142: finished. 143: 144: DST_BYTES zero means that the source area and destination area are 145: overlapped, which means that we can produce a decoded text until it 146: reaches the head of the not-yet-decoded source text. 147: 148: Below is a template for these functions. */ 149: #if 0 150: static void 151: decode_coding_XXX (coding, source, destination, src_bytes, dst_bytes) 152: struct coding_system *coding; 153: const unsigned char *source; 154: unsigned char *destination; 155: int src_bytes, dst_bytes; 156: { 157: ... 158: } 159: #endif 160: 161: /*** GENERAL NOTES on `encode_coding_XXX ()' functions *** 162: 163: These functions encode SRC_BYTES length text at SOURCE from Emacs' 164: internal multibyte format to CODING. The resulting unibyte text 165: goes to a place pointed to by DESTINATION, the length of which 166: should not exceed DST_BYTES. 167: 168: These functions set the information about original and encoded texts 169: in the members `produced', `produced_char', `consumed', and 170: `consumed_char' of the structure *CODING. They also set the member 171: `result' to one of CODING_FINISH_XXX indicating how the encoding 172: finished. 173: 174: DST_BYTES zero means that the source area and destination area are 175: overlapped, which means that we can produce encoded text until it 176: reaches at the head of the not-yet-encoded source text. 177: 178: Below is a template for these functions. */ 179: #if 0 180: static void 181: encode_coding_XXX (coding, source, destination, src_bytes, dst_bytes) 182: struct coding_system *coding; 183: unsigned char *source, *destination; 184: int src_bytes, dst_bytes; 185: { 186: ... 187: } 188: #endif 189: 190: /*** COMMONLY USED MACROS ***/ 191: 192: /* The following two macros ONE_MORE_BYTE and TWO_MORE_BYTES safely 193: get one, two, and three bytes from the source text respectively. 194: If there are not enough bytes in the source, they jump to 195: `label_end_of_loop'. The caller should set variables `coding', 196: `src' and `src_end' to appropriate pointer in advance. These 197: macros are called from decoding routines `decode_coding_XXX', thus 198: it is assumed that the source text is unibyte. */ 199: 200: #define ONE_MORE_BYTE(c1) \ 201: do { \ 202: if (src >= src_end) \ 203: { \ 204: coding->result = CODING_FINISH_INSUFFICIENT_SRC; \ 205: goto label_end_of_loop; \ 206: } \ 207: c1 = *src++; \ 208: } while (0) 209: 210: #define TWO_MORE_BYTES(c1, c2) \ 211: do { \ 212: if (src + 1 >= src_end) \ 213: { \ 214: coding->result = CODING_FINISH_INSUFFICIENT_SRC; \ 215: goto label_end_of_loop; \ 216: } \ 217: c1 = *src++; \ 218: c2 = *src++; \ 219: } while (0) 220: 221: 222: /* Like ONE_MORE_BYTE, but 8-bit bytes of data at SRC are in multibyte 223: form if MULTIBYTEP is nonzero. In addition, if SRC is not less 224: than SRC_END, return with RET. */ 225: 226: #define ONE_MORE_BYTE_CHECK_MULTIBYTE(c1, multibytep, ret) \ 227: do { \ 228: if (src >= src_end) \ 229: { \ 230: coding->result = CODING_FINISH_INSUFFICIENT_SRC; \ 231: return ret; \ 232: } \ 233: c1 = *src++; \ 234: if (multibytep && c1 == LEADING_CODE_8_BIT_CONTROL) \ 235: c1 = *src++ - 0x20; \ 236: } while (0) 237: 238: /* Set C to the next character at the source text pointed by `src'. 239: If there are not enough characters in the source, jump to 240: `label_end_of_loop'. The caller should set variables `coding' 241: `src', `src_end', and `translation_table' to appropriate pointers 242: in advance. This macro is used in encoding routines 243: `encode_coding_XXX', thus it assumes that the source text is in 244: multibyte form except for 8-bit characters. 8-bit characters are 245: in multibyte form if coding->src_multibyte is nonzero, else they 246: are represented by a single byte. */ 247: 248: #define ONE_MORE_CHAR(c) \ 249: do { \ 250: int len = src_end - src; \ 251: int bytes; \ 252: if (len <= 0) \ 253: { \ 254: coding->result = CODING_FINISH_INSUFFICIENT_SRC; \ 255: goto label_end_of_loop; \ 256: } \ 257: if (coding->src_multibyte \ 258: || UNIBYTE_STR_AS_MULTIBYTE_P (src, len, bytes)) \ 259: c = STRING_CHAR_AND_LENGTH (src, len, bytes); \ 260: else \ 261: c = *src, bytes = 1; \ 262: if (!NILP (translation_table)) \ 263: c = translate_char (translation_table, c, -1, 0, 0); \ 264: src += bytes; \ 265: } while (0) 266: 267: 268: /* Produce a multibyte form of character C to `dst'. Jump to 269: `label_end_of_loop' if there's not enough space at `dst'. 270: 271: If we are now in the middle of a composition sequence, the decoded 272: character may be ALTCHAR (for the current composition). In that 273: case, the character goes to coding->cmp_data->data instead of 274: `dst'. 275: 276: This macro is used in decoding routines. */ 277: 278: #define EMIT_CHAR(c) \ 279: do { \ 280: if (! COMPOSING_P (coding) \ 281: || coding->composing == COMPOSITION_RELATIVE \ 282: || coding->composing == COMPOSITION_WITH_RULE) \ 283: { \ 284: int bytes = CHAR_BYTES (c); \ 285: if ((dst + bytes) > (dst_bytes ? dst_end : src)) \ 286: { \ 287: coding->result = CODING_FINISH_INSUFFICIENT_DST; \ 288: goto label_end_of_loop; \ 289: } \ 290: dst += CHAR_STRING (c, dst); \ 291: coding->produced_char++; \ 292: } \ 293: \ 294: if (COMPOSING_P (coding) \ 295: && coding->composing != COMPOSITION_RELATIVE) \ 296: { \ 297: CODING_ADD_COMPOSITION_COMPONENT (coding, c); \ 298: coding->composition_rule_follows \ 299: = coding->composing != COMPOSITION_WITH_ALTCHARS; \ 300: } \ 301: } while (0) 302: 303: 304: #define EMIT_ONE_BYTE(c) \ 305: do { \ 306: if (dst >= (dst_bytes ? dst_end : src)) \ 307: { \ 308: coding->result = CODING_FINISH_INSUFFICIENT_DST; \ 309: goto label_end_of_loop; \ 310: } \ 311: *dst++ = c; \ 312: } while (0) 313: 314: #define EMIT_TWO_BYTES(c1, c2) \ 315: do { \ 316: if (dst + 2 > (dst_bytes ? dst_end : src)) \ 317: { \ 318: coding->result = CODING_FINISH_INSUFFICIENT_DST; \ 319: goto label_end_of_loop; \ 320: } \ 321: *dst++ = c1, *dst++ = c2; \ 322: } while (0) 323: 324: #define EMIT_BYTES(from, to) \ 325: do { \ 326: if (dst + (to - from) > (dst_bytes ? dst_end : src)) \ 327: { \ 328: coding->result = CODING_FINISH_INSUFFICIENT_DST; \ 329: goto label_end_of_loop; \ 330: } \ 331: while (from < to) \ 332: *dst++ = *from++; \ 333: } while (0) 334: 335: ^L 336: /*** 1. Preamble ***/ 337: 338: #ifdef emacs 339: #include <config.h> 340: #endif 341: 342: #include <stdio.h> 343: 344: #ifdef emacs 345: 346: #include "lisp.h" 347: #include "buffer.h" 348: #include "charset.h" 349: #include "composite.h" 350: #include "ccl.h" 351: #include "coding.h" 352: #include "window.h" 353: #include "intervals.h" 354: 355: #else /* not emacs */ 356: 357: #include "mulelib.h" 358: 359: #endif /* not emacs */ 360: 361: Lisp_Object Qcoding_system, Qeol_type; 362: Lisp_Object Qbuffer_file_coding_system; 363: Lisp_Object Qpost_read_conversion, Qpre_write_conversion; 364: Lisp_Object Qno_conversion, Qundecided; 365: Lisp_Object Qcoding_system_history; 366: Lisp_Object Qsafe_chars; 367: Lisp_Object Qvalid_codes; 368: Lisp_Object Qascii_incompatible; 369: 370: extern Lisp_Object Qinsert_file_contents, Qwrite_region; 371: Lisp_Object Qcall_process, Qcall_process_region; 372: Lisp_Object Qstart_process, Qopen_network_stream; 373: Lisp_Object Qtarget_idx; 374: 375: /* If a symbol has this property, evaluate the value to define the 376: symbol as a coding system. */ 377: Lisp_Object Qcoding_system_define_form; 378: 379: Lisp_Object Vselect_safe_coding_system_function; 380: 381: int coding_system_require_warning; 382: 383: /* Mnemonic string for each format of end-of-line. */ 384: Lisp_Object eol_mnemonic_unix, eol_mnemonic_dos, eol_mnemonic_mac; 385: /* Mnemonic string to indicate format of end-of-line is not yet 386: decided. */ 387: Lisp_Object eol_mnemonic_undecided; 388: 389: /* Format of end-of-line decided by system. This is CODING_EOL_LF on 390: Unix, CODING_EOL_CRLF on DOS/Windows, and CODING_EOL_CR on Mac. 391: This has an effect only for external encoding (i.e. for output to 392: file and process), not for in-buffer or Lisp string encoding. */ 393: int system_eol_type; 394: 395: #ifdef emacs 396: 397: /* Information about which coding system is safe for which chars. 398: The value has the form (GENERIC-LIST . NON-GENERIC-ALIST). 399: 400: GENERIC-LIST is a list of generic coding systems which can encode 401: any characters. 402: 403: NON-GENERIC-ALIST is an alist of non generic coding systems vs the 404: corresponding char table that contains safe chars. */ 405: Lisp_Object Vcoding_system_safe_chars; 406: 407: Lisp_Object Vcoding_system_list, Vcoding_system_alist; 408: 409: Lisp_Object Qcoding_system_p, Qcoding_system_error; 410: 411: /* Coding system emacs-mule and raw-text are for converting only 412: end-of-line format. */ 413: Lisp_Object Qemacs_mule, Qraw_text; 414: 415: Lisp_Object Qutf_8; 416: 417: /* Coding-systems are handed between Emacs Lisp programs and C internal 418: routines by the following three variables. */ 419: /* Coding-system for reading files and receiving data from process. */ 420: Lisp_Object Vcoding_system_for_read; 421: /* Coding-system for writing files and sending data to process. */ 422: Lisp_Object Vcoding_system_for_write; 423: /* Coding-system actually used in the latest I/O. */ 424: Lisp_Object Vlast_coding_system_used; 425: 426: /* A vector of length 256 which contains information about special 427: Latin codes (especially for dealing with Microsoft codes). */ 428: Lisp_Object Vlatin_extra_code_table; 429: 430: /* Flag to inhibit code conversion of end-of-line format. */ 431: int inhibit_eol_conversion; 432: 433: /* Flag to inhibit ISO2022 escape sequence detection. */ 434: int inhibit_iso_escape_detection; 435: 436: /* Flag to make buffer-file-coding-system inherit from process-coding. */ 437: int inherit_process_coding_system; 438: 439: /* Coding system to be used to encode text for terminal display. */ 440: struct coding_system terminal_coding; 441: 442: /* Coding system to be used to encode text for terminal display when 443: terminal coding system is nil. */ 444: struct coding_system safe_terminal_coding; 445: 446: /* Coding system of what is sent from terminal keyboard. */ 447: struct coding_system keyboard_coding; 448: 449: /* Default coding system to be used to write a file. */ 450: struct coding_system default_buffer_file_coding; 451: 452: Lisp_Object Vfile_coding_system_alist; 453: Lisp_Object Vprocess_coding_system_alist; 454: Lisp_Object Vnetwork_coding_system_alist; 455: 456: Lisp_Object Vlocale_coding_system; 457: 458: #endif /* emacs */ 459: 460: Lisp_Object Qcoding_category, Qcoding_category_index; 461: 462: /* List of symbols `coding-category-xxx' ordered by priority. */ 463: Lisp_Object Vcoding_category_list; 464: 465: /* Table of coding categories (Lisp symbols). */ 466: Lisp_Object Vcoding_category_table; 467: 468: /* Table of names of symbol for each coding-category. */ 469: char *coding_category_name[CODING_CATEGORY_IDX_MAX] = { 470: "coding-category-emacs-mule", 471: "coding-category-sjis", 472: "coding-category-iso-7", 473: "coding-category-iso-7-tight", 474: "coding-category-iso-8-1", 475: "coding-category-iso-8-2", 476: "coding-category-iso-7-else", 477: "coding-category-iso-8-else", 478: "coding-category-ccl", 479: "coding-category-big5", 480: "coding-category-utf-8", 481: "coding-category-utf-16-be", 482: "coding-category-utf-16-le", 483: "coding-category-raw-text", 484: "coding-category-binary" 485: }; 486: 487: /* Table of pointers to coding systems corresponding to each coding 488: categories. */ 489: struct coding_system *coding_system_table[CODING_CATEGORY_IDX_MAX]; 490: 491: /* Table of coding category masks. Nth element is a mask for a coding 492: category of which priority is Nth. */ 493: static 494: int coding_priorities[CODING_CATEGORY_IDX_MAX]; 495: 496: /* Flag to tell if we look up translation table on character code 497: conversion. */ 498: Lisp_Object Venable_character_translation; 499: /* Standard translation table to look up on decoding (reading). */ 500: Lisp_Object Vstandard_translation_table_for_decode; 501: /* Standard translation table to look up on encoding (writing). */ 502: Lisp_Object Vstandard_translation_table_for_encode; 503: 504: Lisp_Object Qtranslation_table; 505: Lisp_Object Qtranslation_table_id; 506: Lisp_Object Qtranslation_table_for_decode; 507: Lisp_Object Qtranslation_table_for_encode; 508: 509: /* Alist of charsets vs revision number. */ 510: Lisp_Object Vcharset_revision_alist; 511: 512: /* Default coding systems used for process I/O. */ 513: Lisp_Object Vdefault_process_coding_system; 514: 515: /* Char table for translating Quail and self-inserting input. */ 516: Lisp_Object Vtranslation_table_for_input; 517: 518: /* Global flag to tell that we can't call post-read-conversion and 519: pre-write-conversion functions. Usually the value is zero, but it 520: is set to 1 temporarily while such functions are running. This is 521: to avoid infinite recursive call. */ 522: static int inhibit_pre_post_conversion; 523: 524: Lisp_Object Qchar_coding_system; 525: 526: /* Return `safe-chars' property of CODING_SYSTEM (symbol). Don't check 527: its validity. */ 528: 529: Lisp_Object 530: coding_safe_chars (coding_system) 531: Lisp_Object coding_system; 532: { 533: Lisp_Object coding_spec, plist, safe_chars; 534: 535: coding_spec = Fget (coding_system, Qcoding_system); 536: plist = XVECTOR (coding_spec)->contents[3]; 537: safe_chars = Fplist_get (XVECTOR (coding_spec)->contents[3], Qsafe_chars); 538: return (CHAR_TABLE_P (safe_chars) ? safe_chars : Qt); 539: } 540: 541: #define CODING_SAFE_CHAR_P(safe_chars, c) \ 542: (EQ (safe_chars, Qt) || !NILP (CHAR_TABLE_REF (safe_chars, c))) 543: 544: ^L 545: /*** 2. Emacs internal format (emacs-mule) handlers ***/ 546: 547: /* Emacs' internal format for representation of multiple character 548: sets is a kind of multi-byte encoding, i.e. characters are 549: represented by variable-length sequences of one-byte codes. 550: 551: ASCII characters and control characters (e.g. `tab', `newline') are 552: represented by one-byte sequences which are their ASCII codes, in 553: the range 0x00 through 0x7F. 554: 555: 8-bit characters of the range 0x80..0x9F are represented by 556: two-byte sequences of LEADING_CODE_8_BIT_CONTROL and (their 8-bit 557: code + 0x20). 558: 559: 8-bit characters of the range 0xA0..0xFF are represented by 560: one-byte sequences which are their 8-bit code. 561: 562: The other characters are represented by a sequence of `base 563: leading-code', optional `extended leading-code', and one or two 564: `position-code's. The length of the sequence is determined by the 565: base leading-code. Leading-code takes the range 0x81 through 0x9D, 566: whereas extended leading-code and position-code take the range 0xA0 567: through 0xFF. See `charset.h' for more details about leading-code 568: and position-code. 569: 570: --- CODE RANGE of Emacs' internal format --- 571: character set range 572: ------------- ----- 573: ascii 0x00..0x7F 574: eight-bit-control LEADING_CODE_8_BIT_CONTROL + 0xA0..0xBF 575: eight-bit-graphic 0xA0..0xBF 576: ELSE 0x81..0x9D + [0xA0..0xFF]+ 577: --------------------------------------------- 578: 579: As this is the internal character representation, the format is 580: usually not used externally (i.e. in a file or in a data sent to a 581: process). But, it is possible to have a text externally in this 582: format (i.e. by encoding by the coding system `emacs-mule'). 583: 584: In that case, a sequence of one-byte codes has a slightly different 585: form. 586: 587: Firstly, all characters in eight-bit-control are represented by 588: one-byte sequences which are their 8-bit code. 589: 590: Next, character composition data are represented by the byte 591: sequence of the form: 0x80 METHOD BYTES CHARS COMPONENT ..., 592: where, 593: METHOD is 0xF0 plus one of composition method (enum 594: composition_method), 595: 596: BYTES is 0xA0 plus the byte length of these composition data, 597: 598: CHARS is 0xA0 plus the number of characters composed by these 599: data, 600: 601: COMPONENTs are characters of multibyte form or composition 602: rules encoded by two-byte of ASCII codes. 603: 604: In addition, for backward compatibility, the following formats are 605: also recognized as composition data on decoding. 606: 607: 0x80 MSEQ ... 608: 0x80 0xFF MSEQ RULE MSEQ RULE ... MSEQ 609: 610: Here, 611: MSEQ is a multibyte form but in these special format: 612: ASCII: 0xA0 ASCII_CODE+0x80, 613: other: LEADING_CODE+0x20 FOLLOWING-BYTE ..., 614: RULE is a one byte code of the range 0xA0..0xF0 that 615: represents a composition rule. 616: */ 617: 618: enum emacs_code_class_type emacs_code_class[256]; 619: 620: /* See the above "GENERAL NOTES on `detect_coding_XXX ()' functions". 621: Check if a text is encoded in Emacs' internal format. If it is, 622: