Text encoding

JavaScript strings represent Unicode characters (internally encoded in UTF-16). Care must be taken when converting a Unicode string to or from a sequence of 8-bite bytes, such as a file on disk (File class) or a byte array in memory (ByteArray class).

The Codecs supported by scripting API can be classified as follows:

Codecs Description Data types
UTF-16, UTF-16BE, UTF-16LE Map Unicode code points to 16-bit (two byte) representation String <- -> ByteArray / File
UTF-8, latin1, Apple Roman, … Map Unicode code points to 8-bit (single byte) representation String <- -> ByteArray / File
Hex, Base64 Map an 8-bit byte stream into a 7-bit ASCII representation String or ByteArray may be used on both sides

Codec

A codec (short-hand for "code-decode") specifies the text encoding used to represent a Unicode string as a sequence of 8-bit bytes. Various codecs are in use throughout the industry.

The utility module refers to supported codecs through predefined names. To specify a codec in a function argument, use one of the following strings:


In addition the utility module supports the following special codecs (See Hex and Base64 below for more information):


Note:

If the codec argument is missing or null, the codec is set to UTF-8 on Mac and to the current ANSI code page on Windows.

Conversion errors

When converting from a byte sequence to a text string, byte sequences that do not match the requirements for the codec are replaced by the Unicode representation of a question mark ("?").

When converting from a text string to a byte sequence, Unicode code points that cannot be represented as a byte sequence using the codec are replaced by the ASCII representation of a question mark ("?").

Byte order marker (BOM): Input (from a byte sequence to a text string)

Regardless of the specified codec, the utility module automatically recognizes a UTF-16 BOM and when found, the codec is automatically adjusted to the appropriate UTF-16 variant. For example, if the codec argument is set to "UTF-8", the utility module will correctly read UTF-8, UTF-16BE and UTF-16LE.

Byte order marker (BOM): Output (from a text string to a byte sequence)

When the codec is set to one of the UTF-16 variants, the utility module always outputs an appropriate 2-byte UTF-16 BOM at the beginning of the data.

When the codec is set to UTF-8, the utility module does not output a BOM. This is the standard behavior expected by most applications.

When the codec is set to UTF-8BOM, the utility module outputs a 3-byte UTF-8 BOM at the beginning of the data (this is just a marker to indicate UTF-8 since there are no byte-order issues in this encoding). Be careful when using this codec, because many applications are not able to correctly interpret a UTF-8 BOM.

Hex and Base64

Converting between a String and a ByteArray with any of the codecs (except Hex and Base64) is always unambiguous: the String contains the Unicode code points, the ByteArray contains the 16-bit or 8-bit representation.

With a Hex or Base64 codec, the situation is problematic because:


To resolve this issue, the 10 release introduces the following directional codecs:

Codec

Description

toHex toBase64

Map an 8-bit byte stream into the specified 7-bit ASCII representation, regardless of the source and target data types

If the source is a String, only the 8 lowest bits of each Unicode code point are used (as if latin1 encoding was used)

fromHex fromBase64

Map a 7-bit ASCII representation of the specified type into an 8-bit byte stream, regardless of the source and target data types

If the source is a String, only the 8 lowest bits of each Unicode code point are set and higher bits are cleared (as if latin1 encoding was used)

For compatibility reasons, the unidirectional codecs Hex and Base64 can still be used: