content/developer-guide/utility/Internationalization.html (184 lines of code) (raw):
<?xml version="1.0" encoding="UTF-8" ?>
<!DOCTYPE html>
<!--
Licensed to the Apache Software Foundation (ASF) under one
or more contributor license agreements. See the NOTICE file
distributed with this work for additional information
regarding copyright ownership. The ASF licenses this file
to you under the Apache License, Version 2.0 (the
"License"); you may not use this file except in compliance
with the License. You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing,
software distributed under the License is distributed on an
"AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
KIND, either express or implied. See the License for the
specific language governing permissions and limitations
under the License.
-->
<html xmlns="http://www.w3.org/1999/xhtml" xml:lang="en">
<head>
<title>Internationalization</title>
<meta charset="UTF-8"/>
<link rel="stylesheet" type="text/css" href="../book.css"/>
</head>
<body>
<!--
Content below this point is copied in "/asf-staging/book/en/developer-guide.html" file
by the `org.apache.sis.buildtools.book` class in `buildSrc`.
-->
<section>
<header>
<h2 id="Internationalization">Internationalization</h2>
</header>
<p>
In an architecture where a program executed on a server provides its data to multiple clients,
the server’s locale conventions are not necessarily the same as those of the clients.
Conventions may differ in language, but also in the way they write numeric values
(even between two countries that speak the same language) as well in time zone.
To produce messages that conform to the client’s conventions, <abbr>SIS</abbr> uses
two approaches, distinguished by their level of granularity: at the level of the messages themselves,
or at the level of the objects that create the messages.
The approach used also determines whether it is possible to share the same instance of an object for all languages.
</p>
<h3 id="LocalizedString">Distinct character sequences for each locale</h3>
<p>
Some classes are only designed to function according to one locale convention at a time.
This is of course true for the standard implementations of <code>java.text.Format</code>,
as they are entirely dedicated to the work of internationalization.
But it is also the case for other less obvious classes like <code>javax.imageio.ImageReader</code> and <code>ImageWriter</code>.
When one of these classes is implemented by <abbr>SIS</abbr>,
we identify it by implementing the <code>org.apache.sis.util.Localized</code> interface.
The <code class="SIS">getLocale()</code> method of this interface can determine the locale conventions
by which the instance produces its message.
</p>
<p>
Another class that provides different methods for different locales is <code>java.lang.Throwable</code>.
The standard Java <abbr>API</abbr> defines two methods for getting the error message:
<code>getMessage()</code> and <code>getLocalizedMessage()</code>.
Usually those two methods return the same character sequences,
but some exceptions thrown by Apache <abbr>SIS</abbr> may use different locales.
The policy that <abbr>SIS</abbr> tries to apply on a <em>best-effort</em> basis is:
</p>
<ul>
<li><code>getMessage()</code> returns the message in the <abbr title="Java Virtual Machine">JVM</abbr> default locale.
In a client-server architecture, this is often the locale on the server side.
This is the recommended language for logging messages to be read by system administrators.</li>
<li><code>getLocalizedMessage()</code> returns the message in a locale that depends on the context
in which the exception has been thrown. This is often the locale used by a particular <code>Format</code>
or <code>DataStore</code> instance, and can be presumed to be the locale on the client side.
This is the recommended language to show in the user application.</li>
</ul>
<div class="example"><p><b>Example:</b>
If an error occurred while a Japanese client connected to an European server, the localized message may be sent
to the client in Japanese language as given by <code>getLocalizedMessage()</code> while the same error may be
logged on the server side in the French (for example) language as given by <code>getMessage()</code>.
This allows system administrator to analyze the issue without the need to understand client’s language.
</p></div>
<p>
The utility class <code>org.apache.sis.util.Exceptions</code> provides convenience methods to get messages
according to the conventions of a given locale, when this information is available.
</p>
<h3 id="InternationalString">Single instance for all supported locales</h3>
<p>
The <abbr>API</abbr> conventions defined by <abbr>SIS</abbr> or inherited by GeoAPI favour the use of the
<code>InternationalString</code> type when the value of a <code>String</code> type would likely be localized.
This approach allows us to defer the internationalization process to the time when a character sequence is requested,
rather than the time when the object that contains them is created.
This is particularly useful for immutable classes used for creating unique instances independently of locale conventions.
</p>
<div class="example"><p><b>Example:</b>
<abbr>SIS</abbr> includes only one instance of the <code>OperationMethod</code>
type representing the Mercator projection, regardless of the client’s language.
But its <code class="API">getName()</code> method (indirectly) provides an instance of
<code>InternationalString</code>, so that <code>toString(Locale.ENGLISH)</code> returns <cite>Mercator projection</cite>
while <code>toString(Locale.FRENCH)</code> returns <cite>Projection de Mercator</cite>.
</p></div>
<p>
When defining spatial objects independently of locale conventions, we reduce the risk of computational overload.
For example, it is easier to detect that two maps use the same cartographic projection if this last is represented by the
same instance of <code>CoordinateOperation</code>,
even if the projection has a different name depending on the country.
Moreover, certain types of <code>CoordinateOperation</code> may require coordinate transformation matrices,
so sharing a single instance becomes even more preferable in order to reduce memory consumption.
</p>
<h3 id="Locale.ROOT"><code>Locale.ROOT</code> convention</h3>
<p>
All <abbr>SIS</abbr> methods receiving or returning the value of a <code>Locale</code> type accept the <code>Locale.ROOT</code> value.
This value is interpreted as specifying not to localize the text.
The notion of a <dfn>non-localized text</dfn> is a little false, as it is always necessary to chose a formatting convention.
This convention however, though very close to English, is usually slightly different.
For example:
</p>
<ul>
<li>
Identifiers are written as they appear in <abbr>UML</abbr> diagrams,
such as <var>blurredImage</var> instead of <var>Blurred image</var>.
</li>
<li>
Dates are written according to the <abbr>ISO</abbr> 8601 format,
which does not correspond to English conventions.
</li>
<li>
Numbers are written using their <code>toString()</code> methods, rather than using a <code>java.text.NumberFormat</code>.
As a result, there are differences in the number of significant digits,
use of exponential notation and the absence of thousands separators.
</li>
</ul>
<h3 id="UnicodePoint">Treatment of characters</h3>
<p>
In Java, sequences of characters use UTF-16 encoding.
There is a direct correspondence between the values of the <code>char</code> type and the great majority of characters,
which facilitates the use of sequences so long as these characters are sufficient.
However, certain Unicode characters cannot be represented by a single <code>char</code>.
These <i>supplementary characters</i> include certain ideograms,
but also road and geographical symbols in the 1F680 to 1F700 range.
Support for these supplementary characters requires slightly more complex interactions than the classic case,
where we may assume a direct correspondence.
Thus, instead of the loop on the left below, international applications must generally use the loop on the right:
</p>
<div class="row-of-boxes">
<div>
<div class="caption">Loop to Avoid</div>
<pre style="margin-top: 6pt"><code>for (int i=0; i<string.length(); i++) {
char c = string.charAt(i);
if (Character.isWhitespace(c)) {
// A blank space was found.
}
}</code></pre>
</div>
<div>
<div class="caption">Recommended loop</div>
<pre style="margin-top: 6pt"><code>for (int i=0; i<string.length();) {
int c = string.codePointAt(i);
if (Character.isWhitespace(c)) {
// A blank space was found.
}
i += Character.charCount(c);
}</code></pre>
</div>
<div>
<div class="caption">Supplementary character examples</div>
<center>(rendering depends on browser capabilities)</center>
<p style="font-size: 40px">🚉 🚥 🚧 🚫
🚯 🚸 🚺 🚹 🛄 🚭</p>
</div>
</div>
<p>
<abbr>SIS</abbr> supports supplementary characters by using the loop on the right where necessary,
but the loop on the left is occasionally used when it is known that the characters searched for are not supplementary characters,
even if some may be present in the sequence in which we are searching.
</p>
<h4 id="Whitespaces">Blank spaces interpretation</h4>
<p>
Standard Java provides two methods for determining whether a character is a blank space:
<code>Character.isWhitespace(…)</code> and <code>Character.isSpaceChar(…)</code>.
These two methods differ in their interpretations of non-breaking spaces, tabs and line breaks.
The first method conforms to the interpretation currently used in languages such as Java, C/C++ and <abbr>XML</abbr>,
which considers tabs and line breaks to be blank spaces, while non-breaking spaces are read as not blank.
The second method — which conforms strictly to the Unicode definition — makes the opposite interpretation.
</p>
<p>
<abbr>SIS</abbr> uses each of these methods in different contexts.
<code>isWhitespace(…)</code> is used to <em>separate</em> the elements of a list (numbers, dates, words, etc.),
while <code>isSpaceChar(…)</code> is used to ignore blank spaces <em>inside</em> a single element.
</p>
<div class="example"><p><b>Example:</b>
Take a list of numbers represented according to French conventions.
Each number may contain <em>non-breaking spaces</em> as thousands separators,
while the different numbers in the list may be separated by ordinary spaces, tabs or line breaks.
When analyzing a number, we want to consider the non-breaking spaces as being part of the number,
whereas a tab or a line break most likely indicates a separation between this number and the next.
We would thus use <code>isSpaceChar(…)</code>.
Conversely, when separating the numbers in the list, we want to consider tabs and line breaks as separators,
but not non-breaking spaces.
We would thus use <code>isWhitespace(…)</code>.
The role of ordinary spaces, to which either case might apply, should be decided beforehand.
</p></div>
<p>
In practice, this distinction is reflected in the use of <code>isSpaceChar(…)</code> in the implementations of <code>java.text.Format</code>,
or the use of <code>isWhitespace(…)</code> in nearly all the rest of the <abbr>SIS</abbr> library.
</p>
</section>
</body>
</html>