Vladimir Grigorian - Globalization Support

You use Globalization support to store and retrieve data in native language and format. Oracle supports a wide variety of languages and character sets. Globalization support lets you communicate with end users in their native language using their familiar date formats, number formats and sorting sequence. Oracle uses Unicode (a worldwide encoding standard for computing usage) to support the languages.

You define the database character set when you create the database using the CHARACTER SET clause of the CREATE DATABASE statement. The character set stores table names, column names and so on in the dictionary, stores PL/SQL in memory, etc. If you do not specify the character set Oracle uses the US7ASCII character set. US7ASCII is a 7 bit ASCII character set that uses a single byte to store a character set, and it represent 128 characters.

Other widely used single-byte character sets are WE8ISO8859P1 and UTF8. These character sets use 8 bits to represent a character and can represent 256 characters. Oracle also supports multi byte character encoding. Multi byte character encoding is used to represent such languages as Chinese, Japanese, etc. Multi byte encoding schemes can be fixed width encoding schemes or variable width encoding schemes. In a variable width encoding scheme certain characters are represented using 1 byte, and two or more bytes are used to represent other characters.

You can change the database character set only if the new character set is a superset of the current character set.

Oracle lets choose an additional character set that enhances the character processing capabilities. You can specify the second character set when you create the database using the NATIONAL CHARACTER SET clause. If you do not specify NATIONAL CHARACTER SET, Oracle uses the Unicode character set AF16UTF16. The national character set stores data in NCHAR, NVARCHAR2, and NCLOB data type columns.

The national character set can be either AF16UTF1 or UTF8. AF16UTF16 is the default. AF16UTF16 and UTF8 are Unicode character sets. You can not specify AF16UTF16 as the database character set. When choosing a multi byte character set for your database remember that by default the VARCHAR2, CHAR, NVARCHAR2 and NCHAR data types specify the maximum length in bytes, not in characters. You can change this default behavior by setting the NLS_LENGTH_SEMANTICS=CHAR or by providing the semantic information along the column definition (as in VARCHAR2(20 CHAR)).

The client machine can specify a character set different from the database character set by using local environment variables. The database character set should be a superset to the client character set. Oracle converts the character set automatically, but there is some overhead associated with the conversion.

Certain character sets support multiple languages. For example, WE8ISO8859P1 can support all European languages.

The Unicode character set

Unicode is a universal character encoding scheme that allows you to store information from any major language using a single character set. Unicode provides a unique code value for every character, regardless of the platform, language or program. Unicode has both 16-bit and 8-bit encoding.

UTF-16 is the 16 bit encoding of Unicode. It is a fixed width multi byte encoding in which the character codes 0x00 through 0x7F have the same meaning as they do in ASCII. One Unicode character is 2 bytes in this encoding. Characters from both European and Asian scripts are presented in 2 bytes. AF16UTF16 is UTF-16 encoded character set.

UTF-8 is the 8-bit encoding of Unicode. It is a variable-width multi byte encoding in which the character set. One unicode character can be 1 byte, 2 bytes or 3 bytes in this encoding.

Using NLS parameters

You can customize the NLS parameters in the following ways:

- By specifying in the initialization file, which will be used at the instance startup (NLS_DATE_FORMAT = “”YYYY-MM-DD”).

- By setting the parameter as an environment variable (csh: setenv NLS_DATE_FORMAT YYYY-MM-DD or using Microsft Windows Registry.

- By setting the parameter in the Oracle session using ALTER SESSION SET NLS_DATEFORMAT = “YYYY-MM-DD”

- By using certain SQL functions (TO_CHAR (SYSDATE, ‘YYYY-MM-DD’, ‘NLS_DATE_LANGUAGE=AMERICAN’)).

The parameter specified in SQL functions have the highest priority, the next highest is the parameter ALTER SESSION, then environment variable, init parameters, and finally the database parameters.

You can not change certain parameters using ALTER SESSION and you can not specify certain parameters as environment variables.

NLS_LENGTH_SEMANTICS specified at the session level or as init parameter. Defines the character length semantics as byte or character. The default is BYTE. NLS_LENGTH_SEMANTICS does nto apply to tables on SYS and SYTEM they are always in BYTE semantic.

NLS_LANG specified only as environment variable. Has 3 parts – the language, the territory and the character set. None of the parts are mandatory. The format to specify NLS_LANG is <LANGUAGE>_<TERRITORY>.<CHARACTER SET>. The language specifies the language be used for displaying Oracle messages, day names, month names, etc. The territory specifies the default date format, numeric formats, and monetary formats. The character set specifies the character set to be used by the client machine – for example – AMERICAN_AMERICA.WE8ISO8859P1.

NLS_LANGUAGE specified at the session level or as init parameter. Sets the language to be used. The session value overrides the NLS_LANG setting. The default values for parameters such as NLS_CURRENCY, NLS_ISO_CURRENCY, NLS_DATE_FORMAT, and NLS_NUMERIC_CHARACTERS are derived from NLS_TERRITORY.

NLS_DATE_FORMAT – specified at the session level or as environment variable, or as init parameter. Sets default format for date displays.

NLS_DATE_LANGUAGE - specified at the session level or as environment variable, or as init parameter. Sets language explicitly date and month names.

NLS_TIMESTAMP_FORMAT - specified at the session level or as environment variable, or as init parameter. Sets default format for date displays.

NLS_CALENDAR – specified at the session level, or as environment variable, or as init parameter. Sets the calendar Oracle uses.

NLS_NUMERIC_CHARACTERS – specified at the session level, or as environment variable, or as an init parameter. Specifies the decimal character and group separator.

NLS_CURRENCY – specified at the session level, or as environment variable or as init parameter. Specifies a currency symbol.

NLS_SO_CURRENCY – specifies the ISO currency symbol.

NLS_DUAL_CURRENCY – specifies an alternative currency symbol.

NLS_SORT – language for sorting.

Obtaining NLS data dictionary information

NLS_DATABASE_PARAMETERS – shows NLS parameters defined for the database.

NLS_INSTANCE_PARAMETERS - NLS parameters defined for the instance.

NLS_SESSION_PARAMETERS - NLS parameters defined for the current session.

V$NLS_VALID_VALUES – shows the allowed values for the language territory and character set definitions.

The following examples show NLS information from the data dictionary view and examples of changing session NLS values.

SQL> SELECT * FROM NLS_DATABASE_PARAMETERS;

Restrictions of Globalization support

- Keywords are displayed only in English

- No fixed-width multi-byte character sets can be used as database character sets

- Variable-width character sets sue space more efficiently.

Display of the characters on the screen

Database character sets

Character sets defined at the time of database creation

Data columns stored with data type CHAR, VARCHAR2, CLOB and LONG

Store varying width multi-byte character sets

National character sets

Character sets defined at the timke of database creation

Data columns stored with data type NCHAR, NVARCHAR2 and NCLOB

Store varying width multi byte character sets

Store fixed-width multi byte character sets

If you have string operations, the string operations might be faster if you select the fixed width multi byte character set to represent the national character set.

Variable width character sets can save storage space.

Globalization support parameters

Init parameters – does not affect client

Environment variables

Alter session

The init parameters that specify the default language dependent behavior are NLS_LANGUAGE and NLS_TERRITORY.

You can use the NLS_LANGUAGE parameter to specify the value for the language-dependent session characteristics. One of the language-dependent session characteristics is the display of the error messages. The server messages are stored in binary frmat files. Multiple versions of these files exist, one ofr each supported language. The files use specific naming conventions. For example, the name of the file storing server messages in French is ORAF.MSB. Another rsession characteristic is the language used for the display of day and month names and their abbreviations. Language dependent session characteristics also include symbols AM, PM, BC. Etc. The language dependent session characteristics also include default sorting mechanism.

The NLS_TERRITORY parameter specifies the conventions for the default numeric formatting characteristics such as date format, local currency symbol, ISO currency symbol. Some other default conventions that are specified by the NLS_TERRITORY parameter are decimal character and group separator, the list separator the week start day, the credit and debit symbols and the ISO week number calculation.

Both NLS_LANGUAGE and NLS_TERRITORY can be specified either as init parameters or as ALTER SESSION parameters, but they can not be specified as environment variables.

Derived globalization support parameters

The derived parameters of NLS_LANGUAGE are NLS_DATE_LANGUAGE (language for the day and month names) and NLS_SORT (linguistic sort). Both override NLS_LANGAUGE.

The derived parameters of NLS_TERRITORY are NLS_CURRENCY (, NLS_ISO_CURRENCY, NLS_DATE_FORMAT (“MM/DD/YYYY”), and NLS_NUMERIC_CHARACTERS.

Overriding default globalization support behavior

The NLS_LANG environment variable has 3 components – language, territory and charset NLS_LANG=<language>_<territory>.<charset>

Language dependent behavior of a session

An implicit ALTER SESSION command is normally executed when a session connects to a database. Users can also use the ALTER SESSION command during a session to change the globalization support behavior. If NLS_LANG is not defined, the implicit ALTER SESSION is not executed.

Globalization and sorting

A conventional sorting mechanism is the binary sort. The linguistic sorting of multi-byte character sets is not supported. Therefore, if the database character set is multi-byte the result of any sort is binary.

The NLS_COMP init parameter controls the method used by the comparison operators such as <, >, etc. When this parameter is set to NASI(, the values are compared by the binary value of the string. When the parameter is set to ANSI the values are compared by the linguistic sort sequence.

SQL functions

Some SQL functions can use globalization support features – TO_CHAR, TO_DATE, TO-NUMBER, NLS_UPPER, NLS_LOWER, NLS_INITCAP, NLSSORT.

The parameters that can be used with SQL functions are NLS_DATE_LANGUAGE, NLS_CALENDAR, NLS_NUMERIC_CHARACTERS, NLS_CURRENCY, NLS_ISO_CURRENCY and NLS_SORT.

TO_DATE – NLS_DATE_LANGUAGE, NLS_CALENDAR

TO_NUMBER – NLS_NUMERIC_CHARACTER, NLS_CURRENCY, NLS_ISO_CURRENCY

TO_CHAR – NLS_DATE_LANGUAGE, NLS_NUMERIC_CHARACTERS, NLS_CURRENCY, NLS_ISO_CURRENCY, NLS_CALENDAR

NLS_UPPER, NLS_LOWER, NLS_INITCAP, NLS_SORT – NLS_SORT

The following globalization parameters are NOT accepted in SQL functions – NLS_LANGUAGE, NLS_TERRITORY, NLS_DATE_FORMAT.

Character migration

When you migrate from a single byte character set to a multi byte variable width character set some characters transalate into 2 byte characters. Therefore, these characters may not fit into the existing columns – the data is truncated. During the data scanning 3 tests are performed to ensure that the conversion is OK.

Do the data cell codes change when converted?
can the data cells be successfully converted?
Does the post-conversion data fit into the current column width?

The maximum size of the LONG and LOB data types are 2Gb and 4GB respectively.

Scanner process invocation steps

Prepare the scanner – CSMNST.SQL
Invoke the scanner - $csssscan STSREM/MANAGER FULL = y TOCHAR = utf8 ARRAy = 10240

Obtaining information on character sets

SQL> SELECT parameter, value from nls_databse_parameters

Where parameter lile ‘%characterset%’;

The views that contain information about the globalization support are:

NLS_DATABASE_PARAMETERS, NLS_INSTANCE_PARAMETRS, NLS_SESSION_PARAMETERS, V$NLS_VALID_VALUES, V$NLS_PARAMETERS (contains real time instance specific information)..

To retrieve information about the supported languages, you can issue the select statement displayed on the screen –

SQL> SELECT * FROM V$NLS_VALID_VALUES

WHERE PARAMETER=’LANGUAGE’;

Test Identify the tests that are performed on data during scan

- Find out whether the post conversion data fits into the current column size

- Find out whether the character codes of the data change

- Determine whether the data cells can be successfully converted

Select the features of various semantics

- Number of byte semantics is equal to the naber of characters

- Semantics can be define at both database and column levels

- Byte semantics define the bytes that can be stored in a column

- NLS_LENGTH_SEMANTICS parameter can be set dynamically

Globalization support

- users can interact, store and retrieve data in their native language, including Western European, Eastern European, Middle Eastern, East Asian and Southeast Asian languages.

- Different countries and geographies dictate different cultural conventions that directly affect data formats.

- Many different character encoding schemes including single bye, multi byte and fixed width encoded character sets are supported.

- The Oracle server provides many different linguistic sorts for linguistically accurate sorting

- Database utilities and error messages appear in the supported native language. The Oracle products are translated into 30 different languages.

- Date and time formats can be expressed according to ISO conventional for fractual seconds, seconds, minute, hour, day, month and year. Time zone regions can be used to support daylight savings time.

- National calendars such as Gregorian, Japanese, Imperial and Thai Buddha are supported.

- Numeric data is represented in appropriate local formats. Currency symbols reflect the local economy and the ISO conventions. Credit and debit symbols also different from location to location.

Different types of encoding schemes

Oracle supports different classes of character encoding schemes:

Single-byte character sets:

- 7-bit

- 8-bit

Varying-width multibyte character sets

Fixed-width multibyte character sets

Unicode (AL32UFT8, AL16UTF16, UTF8)

Character encoding schemes

A character encoding scheme specifies numeric codes corresponding to characters that a computer or terminal can display and receive.

Character encoding schemes are used to interpret data into meaningful symbols from a terminal to a host machine. Oracle provides different encoding schemes – Single-byte, varying-width, fixed-width, Unicode.

Single-byte character set

In a single byte character set each character occupies one byte. Single-byte 7 bit encoding schemes can define up to 128 characters (2X7) characters, single byte 8 bit encoding scheme can define up to 256 (2X8) characters.

Examples of Single Byte Schemes

7-bit character set: ASCII 7-bit American (ASCII)

8-bit character set:

- ISO 8859-1 West European (WE8ISO8859P1)

- EBCDIC Code page 500 8-bit West European (WE8EBCDIC500)

- DEC 8-bit West European (WE8DEC)

Varying width multibyte character sets

A varying width multibyte character set is represented by one or more bytes per character. Multibyte character sets are commonly used for Asian language support. Some mutibyte encoding schemes use the value of the most significant bit to indicate if a byte represents a single byte or a part of a series of bytes representing a character. However, other character encoding schemes differentiate single-byte from multi-byte characters. A shift out control code sent by a device indicates that the following bytes are double byte characters, until a shift-in code is encountered.

Examples of varying width multibyte schemes:

- Japanese extended UNIX code (JEUC)

- Chinese GB2312-80 (CGB2312-80)

- AL32UTF (UTF8)

Fixed width multibyte character sets

Fixed width multibyte character sets provide support similar to multibyte character sets, except that the format is a fixed number of bytes for each character. This provides the benefit of having a uniform size byte representation for each character. Only one fixed width multibyte character set is supported and is only in the National character set AL16UTF16.

Example of fixed width multibyte character sets:

- AL16UTF16, 16-bit Unicode (fixed width 2-byte Unicode)

Unicode character set

Unicode is a worldwide character encoding standard that can represent all characters for computer usage, including technical symbols and characters in publishing. Unicode 3.0 contains 49,149 characters with a capacity of over 1 million characters.

The Unicode character can be represented ina number of different encoding formats. UTF-16 (Universal Character Set Transformation Format) is a two-byte, fixed-width format; UTF-8 is a multibyte, varying width format.

Oracle provides AL32UTF8, UTF8, UTFE as database character sets and AL16UTF16 and UTF8 as national character sets. The advantage of UTF-8 based character sets is that they include ASCII using the same single-byte encoding. UTF-8 is a superset of ASCII, hence this makes database character set migration easier when upgrading ASCII based character sets to Unicode.

Database Character Sets and national Character Sets

Database Character Sets	National Character Sets
Defined at database creation time	Defined at database creation time
May not be changed without re-creation, few exceptions	May not be changed without re-creation, few exceptions
Store data in CHAR, VARCHAR2, CLOB, LONG	Store data columns of type NCHAR, NVARCHAR2, NCLOB
Can store varying width character sets	Can store Unicode using either AL16UTF16 or UTF8

Character set types

The CREATE DATABASE statement has the CHARACTER SET clause and the additional optional clause NATIONAL CHARACTER SET to declare the character set to be used as the database character set and the national character set. If no national character set is present the default is AL16UTF16.

Because the database character set is used to identify and hold SQL and PL/SQL source code, it must have either EBCDIC or 7-bit ASCII as a subset, whichever is native to the platform. Therefore, it is not possible to use a fixed width multibyte character set, and only a the national character set.

The national character set is for Unicode storage only and the SQL NCHAR datatypes (NCHAR, NVARCHAR2, NCLOB) are Unicode datatypes in Oracle9i.

Guidelines for choosing an Oracle database Character set

What language does the database need to support?

Several character sets may meet your current requirements, but you should consider future enhancements as well. If you know that you will need to expand support in the future for different languages, picking a character set with a wider range now will eliminate the need for migration later.

What are interoperability concerns with system resources and applications?

While the database maintains and processes the actual character data, there are other resources that you must depend on from the OS. For instance, the OS supplies fonts that correspond to the character set that you have chosen. Input methods that support the language(s) desired and application software must also be compatible with a particular set.

If you choose a character set that is different from what is available on the OS, Oracle will convert the OS character set to the database character set. However, there is some character set conversion overhead, and you need to make sure that the OS set has an equivalent character repertoire to avoid any possible data loss.

What are the performance implications?

There can be different performance overheads in handling different encoding schemes, depending on the character set that is chosen. For the best performance you should always try to choose a character set that avoids character conversion and uses the most efficient encoding for the languages desired. Single-byte character sets are the most efficient in terms of performance and space management. However, single-byte character sets limit the number of supported languages.

What are the restrictions?

You can not choose a database character set that is fixed-width multibyte.

Guidelines for choosing an Oracle national character set

Two choices – AL16UTF16 and UTF8

Is space an issue? Then use UTF8 – it is variable width, but slower

Performance an issue – use – AL16UTF16, as it is fixed width.

Is performance an issue?

AL16UTF16 is a fixed width 2-byte Unicode character set. UTF8 is a variable width 1 to 3 byte Unicode character set. European characters, in UTF8, are stored in 1 to 2 bytesand thus save space over AL16UTF16 which stores characters in 2 bytes. Asian characters, in UTF8, are stored in 3 bytes and require more space than AL16UTF16. Because AL16UTF16 is a fixed width encoding, it performs faster than the variable width UTF8.

Choosing a Unicode solution – Unicode database

When should you use a Unicode database?

- Easy code migration for Java or PL/SQL

- Easy data migration from ASCII based data

- Evenly distributed multilingual data

- Intermedia text search

Choosing a Unicode solution – Unicode datatype

When should you use a Unicode datatype?

- Adding multilingual support incrementally

- Packaged applications

- Performance (single byte database character set with a fixed width national set SQL NCHAR using AL16UTF16, as UTF8 has some performance overhead)

- Better support for UTF16 with windows clients

Dependent language and territory default values

PARAMETER

VALUES

NLS_LANGUAGE

- NLS_DATE_LANGUAGE

- NLS_SORT

AMERICAN

- AMERICAN

- BINARY

NLS_TERRITORY

- NLS_CURRENCY

- NLS_ISO_CURRENCY

- NLS_DATE_FORMAT

- NLS_NUMERIC_CHARACTERS

AMERICA

- $

- AMERICA

- DD-MON-RR

- , .