Internationalization

In an architecture where a program executed on a server provides its data to multiple clients, the server’s locale conventions are not necessarily the same as those of the clients. Conventions may differ in language, but also in the way they write numeric values (even between two countries that speak the same language) as well in time zone. To produce messages that conform to the client’s conventions, SIS uses two approaches, distinguished by their level of granularity: at the level of the messages themselves, or at the level of the objects that create the messages. The approach used also determines whether it is possible to share the same instance of an object for all languages.

Distinct character sequences for each locale

Some classes are only designed to function according to one locale convention at a time. This is of course true for the standard implementations of java.text.Format, as they are entirely dedicated to the work of internationalization. But it is also the case for other less obvious classes like javax.imageio.ImageReader and ImageWriter. When one of these classes is implemented by SIS, we identify it by implementing the org.apache.sis.util.Localized interface. The getLocale() method of this interface can determine the locale conventions by which the instance produces its message.

Another class that provides different methods for different locales is java.lang.Throwable. The standard Java API defines two methods for getting the error message: getMessage() and getLocalizedMessage(). Usually those two methods return the same character sequences, but some exceptions thrown by Apache SIS may use different locales. The policy that SIS tries to apply on a best-effort basis is:

getMessage() returns the message in the JVM default locale. In a client-server architecture, this is often the locale on the server side. This is the recommended language for logging messages to be read by system administrators.
getLocalizedMessage() returns the message in a locale that depends on the context in which the exception has been thrown. This is often the locale used by a particular Format or DataStore instance, and can be presumed to be the locale on the client side. This is the recommended language to show in the user application.

Example: If an error occurred while a Japanese client connected to an European server, the localized message may be sent to the client in Japanese language as given by getLocalizedMessage() while the same error may be logged on the server side in the French (for example) language as given by getMessage(). This allows system administrator to analyze the issue without the need to understand client’s language.

The utility class org.apache.sis.util.Exceptions provides convenience methods to get messages according to the conventions of a given locale, when this information is available.

Single instance for all supported locales

The API conventions defined by SIS or inherited by GeoAPI favour the use of the InternationalString type when the value of a String type would likely be localized. This approach allows us to defer the internationalization process to the time when a character sequence is requested, rather than the time when the object that contains them is created. This is particularly useful for immutable classes used for creating unique instances independently of locale conventions.

Example: SIS includes only one instance of the OperationMethod type representing the Mercator projection, regardless of the client’s language. But its getName() method (indirectly) provides an instance of InternationalString, so that toString(Locale.ENGLISH) returns Mercator projection while toString(Locale.FRENCH) returns Projection de Mercator.

When defining spatial objects independently of locale conventions, we reduce the risk of computational overload. For example, it is easier to detect that two maps use the same cartographic projection if this last is represented by the same instance of CoordinateOperation, even if the projection has a different name depending on the country. Moreover, certain types of CoordinateOperation may require coordinate transformation matrices, so sharing a single instance becomes even more preferable in order to reduce memory consumption.

`Locale.ROOT` convention

All SIS methods receiving or returning the value of a Locale type accept the Locale.ROOT value. This value is interpreted as specifying not to localize the text. The notion of a non-localized text is a little false, as it is always necessary to chose a formatting convention. This convention however, though very close to English, is usually slightly different. For example:

Identifiers are written as they appear in UML diagrams, such as blurredImage instead of Blurred image.
Dates are written according to the ISO 8601 format, which does not correspond to English conventions.
Numbers are written using their toString() methods, rather than using a java.text.NumberFormat. As a result, there are differences in the number of significant digits, use of exponential notation and the absence of thousands separators.

Treatment of characters

In Java, sequences of characters use UTF-16 encoding. There is a direct correspondence between the values of the char type and the great majority of characters, which facilitates the use of sequences so long as these characters are sufficient. However, certain Unicode characters cannot be represented by a single char. These supplementary characters include certain ideograms, but also road and geographical symbols in the 1F680 to 1F700 range. Support for these supplementary characters requires slightly more complex interactions than the classic case, where we may assume a direct correspondence. Thus, instead of the loop on the left below, international applications must generally use the loop on the right:

Loop to Avoid

for (int i=0; i<string.length(); i++) {
    char c = string.charAt(i);
    if (Character.isWhitespace(c)) {
        // A blank space was found.
    }
}

Recommended loop

for (int i=0; i<string.length();) {
    int c = string.codePointAt(i);
    if (Character.isWhitespace(c)) {
        // A blank space was found.
    }
    i += Character.charCount(c);
}

Supplementary character examples

(rendering depends on browser capabilities)

🚉 🚥 🚧 🚫 🚯 🚸 🚺 🚹 🛄 🚭

SIS supports supplementary characters by using the loop on the right where necessary, but the loop on the left is occasionally used when it is known that the characters searched for are not supplementary characters, even if some may be present in the sequence in which we are searching.

Blank spaces interpretation

Standard Java provides two methods for determining whether a character is a blank space: Character.isWhitespace(…) and Character.isSpaceChar(…). These two methods differ in their interpretations of non-breaking spaces, tabs and line breaks. The first method conforms to the interpretation currently used in languages such as Java, C/C++ and XML, which considers tabs and line breaks to be blank spaces, while non-breaking spaces are read as not blank. The second method — which conforms strictly to the Unicode definition — makes the opposite interpretation.

SIS uses each of these methods in different contexts. isWhitespace(…) is used to separate the elements of a list (numbers, dates, words, etc.), while isSpaceChar(…) is used to ignore blank spaces inside a single element.

Example: Take a list of numbers represented according to French conventions. Each number may contain non-breaking spaces as thousands separators, while the different numbers in the list may be separated by ordinary spaces, tabs or line breaks. When analyzing a number, we want to consider the non-breaking spaces as being part of the number, whereas a tab or a line break most likely indicates a separation between this number and the next. We would thus use isSpaceChar(…). Conversely, when separating the numbers in the list, we want to consider tabs and line breaks as separators, but not non-breaking spaces. We would thus use isWhitespace(…). The role of ordinary spaces, to which either case might apply, should be decided beforehand.

In practice, this distinction is reflected in the use of isSpaceChar(…) in the implementations of java.text.Format, or the use of isWhitespace(…) in nearly all the rest of the SIS library.

Distinct character sequences for each locale

Single instance for all supported locales

Locale.ROOT convention

Treatment of characters

Blank spaces interpretation

`Locale.ROOT` convention