Text encoding

JavaScript strings represent Unicode characters (internally encoded in UTF-16). Care must be taken when converting a Unicode string to or from a sequence of 8-bite bytes, such as a file on disk (File class) or a byte array in memory (ByteArray class).

The Codecs supported by scripting API can be classified as follows:

Codecs	Description	Data types
UTF-16, UTF-16BE, UTF-16LE	Map Unicode code points to 16-bit (two byte) representation	String <- -> ByteArray / File
UTF-8, latin1, Apple Roman, …	Map Unicode code points to 8-bit (single byte) representation	String <- -> ByteArray / File
Hex, Base64	Map an 8-bit byte stream into a 7-bit ASCII representation	String or ByteArray may be used on both sides

Codec

A codec (short-hand for "code-decode") specifies the text encoding used to represent a Unicode string as a sequence of 8-bit bytes. Various codecs are in use throughout the industry.

The utility module refers to supported codecs through predefined names. To specify a codec in a function argument, use one of the following strings:

Apple Roman
ISO 8859-1 to 10
ISO 8859-13 to 16
latin1
UTF-8
UTF-8BOM (see byte order marker below)
UTF-16
UTF-16BE
UTF-16LE
Windows-1250 to 1258

In addition the utility module supports the following special codecs (See Hex and Base64 below for more information):

Hex (corresponds to xsd:hexBinary)
Base64 (corresponds to xsd:base64Binary and SOAP-ENC:base64)
toHex, toBase64
fromHex, fromBase64

Note:

If the codec argument is missing or null, the codec is set to UTF-8 on Mac and to the current ANSI code page on Windows.

Conversion errors

When converting from a byte sequence to a text string, byte sequences that do not match the requirements for the codec are replaced by the Unicode representation of a question mark ("?").

When converting from a text string to a byte sequence, Unicode code points that cannot be represented as a byte sequence using the codec are replaced by the ASCII representation of a question mark ("?").

Byte order marker (BOM): Input (from a byte sequence to a text string)

Regardless of the specified codec, the utility module automatically recognizes a UTF-16 BOM and when found, the codec is automatically adjusted to the appropriate UTF-16 variant. For example, if the codec argument is set to "UTF-8", the utility module will correctly read UTF-8, UTF-16BE and UTF-16LE.

Byte order marker (BOM): Output (from a text string to a byte sequence)

When the codec is set to one of the UTF-16 variants, the utility module always outputs an appropriate 2-byte UTF-16 BOM at the beginning of the data.

When the codec is set to UTF-8, the utility module does not output a BOM. This is the standard behavior expected by most applications.

When the codec is set to UTF-8BOM, the utility module outputs a 3-byte UTF-8 BOM at the beginning of the data (this is just a marker to indicate UTF-8 since there are no byte-order issues in this encoding). Be careful when using this codec, because many applications are not able to correctly interpret a UTF-8 BOM.

Hex and Base64

Converting between a String and a ByteArray with any of the codecs (except Hex and Base64) is always unambiguous: the String contains the Unicode code points, the ByteArray contains the 16-bit or 8-bit representation.

With a Hex or Base64 codec, the situation is problematic because:

Both decoded and encoded sides can be represented in 8 bits
Following the logic that "the encoded side is in the ByteArray" (as with the other codecs) contradicts the frequent use case where "the encoded side is in a String" (for example, part of an XML stream).

To resolve this issue, the 10 release introduces the following directional codecs:

Codec	Description
toHex toBase64	Map an 8-bit byte stream into the specified 7-bit ASCII representation, regardless of the source and target data types If the source is a String, only the 8 lowest bits of each Unicode code point are used (as if latin1 encoding was used)
fromHex fromBase64	Map a 7-bit ASCII representation of the specified type into an 8-bit byte stream, regardless of the source and target data types If the source is a String, only the 8 lowest bits of each Unicode code point are set and higher bits are cleared (as if latin1 encoding was used)

Codec

Description

toHex toBase64

Map an 8-bit byte stream into the specified 7-bit ASCII representation, regardless of the source and target data types

If the source is a String, only the 8 lowest bits of each Unicode code point are used (as if latin1 encoding was used)

fromHex fromBase64

Map a 7-bit ASCII representation of the specified type into an 8-bit byte stream, regardless of the source and target data types

If the source is a String, only the 8 lowest bits of each Unicode code point are set and higher bits are cleared (as if latin1 encoding was used)

For compatibility reasons, the unidirectional codecs Hex and Base64 can still be used:

Hex/Base64 will operate as toHex/toBase64 when the target is a ByteArray
Hex/Base64 will operate as fromHex/fromBase64 when target is a String.