Text encoding issues

On modern operating systems – including Windows XP and Mac OS X – file and folder names may contain any Unicode code point (except for a few platform-specific separator characters). This is true even if the default 8-bit code page is not able to represent these characters. For example, on a regular English Windows XP system it is perfectly possible to have Greek or Cyrillic characters in a filename, although these characters cannot be represented in the default Latin code page.

Switch is fully Unicode enabled and for optimal operation it requires a configured third-party application to be Unicode enabled as well. In other words Switch requires that the third-party application:

Correctly works with filenames that may contain any Unicode code point.
Performs all text communication with Switch using an appropriate and well-defined character encoding (as explained in more detail below).

Command line on Mac OS X

Mac OS X uses UTF-8 as its default encoding for representing filenames/paths. In our experience, a command line application can simply take a file path from the command line as an 8-bit string and pass it through to a regular UNIX or Mac OS file system call even if the command line application itself is not Unicode aware. Since the functions of the Switch Process class use UTF-8 to invoke command line applications, things will automatically work correctly.

Command line on Windows

On Windows the functions of the Switch Process class use Windows-specific Unicode-enabled function calls to invoke command line applications. The command line application in turn MUST invoke the Windows-specific Unicode-enabled function calls for retrieving the command line AND for opening the files. For example, in C/C++ the command line application must use the Unicode ("wide") version of the Windows-specific "GetCommandLine" function rather than the argv[] argument in the main function (which only supports the current default code page).

Other text input/output

This includes the console input/output streams and any exchanged control files that may contain plain text (as opposed to XML which has built-in Unicode support).

It is strongly recommended to use UTF-8 for all text input/output because this encoding can represent any Unicode code point and it is upwards compatible with 7-bit ASCII in various ways (for example, with regards to line breaks and null-terminators).

If this is not feasible, at the very least the encoding used must be well-defined and documented, and it should follow these guidelines:

Text that may contain a filename/path must be able to represent any Unicode code point. In essence UTF-8 or UTF-16 are the only options.
Text that may contain code points outside 7-bit ASCII (for example, localized messages for human readers) must be in a well-defined encoding that is able to represent the characters in the languages used. Preferably this encoding is the same on all platforms; or at least it is well defined. Again UTF-8 is the best option.
Text that only contains 7-bit ASCII code points (for example, keywords that are not localized and are not intended for human consumption) is not encoding-sensitive. Still it does not hurt to use UTF-8 since it is compatible with 7-bit ASCII.