JavaScript strings represent Unicode characters (internally encoded in UTF-16). Care must be taken when converting a Unicode string to or from a sequence of 8-bite bytes, such as a file on disk (File class) or a byte array in memory (ByteArray class).
The Codecs supported by scripting API can be classified as follows:
Codecs | Description | Data types |
---|---|---|
UTF-16, UTF-16BE, UTF-16LE | Map Unicode code points to 16-bit (two byte) representation | String <- -> ByteArray / File |
UTF-8, latin1, Apple Roman, … | Map Unicode code points to 8-bit (single byte) representation | String <- -> ByteArray / File |
Hex, Base64 | Map an 8-bit byte stream into a 7-bit ASCII representation | String or ByteArray may be used on both sides |
A codec (short-hand for "code-decode") specifies the text encoding used to represent a Unicode string as a sequence of 8-bit bytes. Various codecs are in use throughout the industry.
The utility module refers to supported codecs through predefined names. To specify a codec in a function argument, use one of the following strings:
Apple Roman
ISO 8859-1 to 10
ISO 8859-13 to 16
latin1
UTF-8
UTF-8BOM (see byte order marker below)
UTF-16
UTF-16BE
UTF-16LE
Windows-1250 to 1258
In addition the utility module supports the following special codecs (See Hex and Base64 below for more information):
Hex (corresponds to xsd:hexBinary)
Base64 (corresponds to xsd:base64Binary and SOAP-ENC:base64)
toHex, toBase64
fromHex, fromBase64
If the codec argument is missing or null, the codec is set to UTF-8 on Mac and to the current ANSI code page on Windows.
When converting from a byte sequence to a text string, byte sequences that do not match the requirements for the codec are replaced by the Unicode representation of a question mark ("?").
When converting from a text string to a byte sequence, Unicode code points that cannot be represented as a byte sequence using the codec are replaced by the ASCII representation of a question mark ("?").
Regardless of the specified codec, the utility module automatically recognizes a UTF-16 BOM and when found, the codec is automatically adjusted to the appropriate UTF-16 variant. For example, if the codec argument is set to "UTF-8", the utility module will correctly read UTF-8, UTF-16BE and UTF-16LE.
When the codec is set to one of the UTF-16 variants, the utility module always outputs an appropriate 2-byte UTF-16 BOM at the beginning of the data.
When the codec is set to UTF-8, the utility module does not output a BOM. This is the standard behavior expected by most applications.
When the codec is set to UTF-8BOM, the utility module outputs a 3-byte UTF-8 BOM at the beginning of the data (this is just a marker to indicate UTF-8 since there are no byte-order issues in this encoding). Be careful when using this codec, because many applications are not able to correctly interpret a UTF-8 BOM.
Converting between a String and a ByteArray with any of the codecs (except Hex and Base64) is always unambiguous: the String contains the Unicode code points, the ByteArray contains the 16-bit or 8-bit representation.
With a Hex or Base64 codec, the situation is problematic because:
Both decoded and encoded sides can be represented in 8 bits
Following the logic that "the encoded side is in the ByteArray" (as with the other codecs) contradicts the frequent use case where "the encoded side is in a String" (for example, part of an XML stream).
To resolve this issue, the 10 release introduces the following directional codecs:
Codec |
Description |
---|---|
toHex toBase64 |
Map an 8-bit byte stream into the specified 7-bit ASCII representation, regardless of the source and target data types If the source is a String, only the 8 lowest bits of each Unicode code point are used (as if latin1 encoding was used) |
fromHex fromBase64 |
Map a 7-bit ASCII representation of the specified type into an 8-bit byte stream, regardless of the source and target data types If the source is a String, only the 8 lowest bits of each Unicode code point are set and higher bits are cleared (as if latin1 encoding was used) |
For compatibility reasons, the unidirectional codecs Hex and Base64 can still be used:
Hex/Base64 will operate as toHex/toBase64 when the target is a ByteArray
Hex/Base64 will operate as fromHex/fromBase64 when target is a String.