Skip Headers

Oracle® ; Database Globalization Support Guide
10g Release 1 (10.1)

Part Number B10749-01
Go to Documentation Home
Home
Go to Book L
ist
Book List
Go to Table of Contents
Contents
Go to Index
Index
Go to Ma
ster Index
Master Index
Go to Feedback page
Feedback

Go to previous page
Previous

Go to next page
Next
View PDF

2
Choosing a Character Set

This chapter explains how to choose a character set. It includes the following topics:

Character Set Encoding

When computer systems process characters, they use numeric codes instead of the graphical representation of the character. For e xample, when the database stores the letter A, it actually stores a numeric code that is interpreted by software as the letter. These numeric codes are especially important in a global environment because of the potential need to convert data between di fferent character sets.

This section includes the following topics:

What is an Encoded Character Set?

You specify an encoded character set when you create a database. Choosing a c haracter set determines what languages can be represented in the database. It also affects:

  • How you create the database schema
  • How you devel op applications that process character data
  • How the database works with the op erating system
  • Performance
  • Storage required when storing character data

A group of characters (for example, alphabetic characters, ideographs, symbols, punctuation marks, and control characters) can be encoded as a character set. An encoded character set assigns unique numeric codes to each character in the character repertoire. The numeric c odes are called code points or encoded values. Table 2-1 shows examples of characters that have been assigned a hexadecimal code value in the ASCII character set.

Table 2-1 Encoded Characters in the ASCII Character Set  
Character Description Hexadecimal Code Va lue

!

Exclamation Mark

21

#

Number Sign

23

$

Dollar Sign

24

1

Number 1

31

2

Number 2

32

3

Number 3

33

A

Uppercase A

41

B

Uppercase B

42

C

Uppercase C

43

a

Lowercase a

61

b

Lowercase b

62

c

Lowercase c

63

The computer industry uses many encod ed character sets. Character sets differ in the following ways:

Oracle supports most national, international, and vendor-specific encoded character set standards.

See Also:

"Character Sets" for a complete list of character sets that a re supported by Oracle

Which Characters Are Encoded?

The characters that are encoded in a character set depend on the writing systems that are represented. A writing system can be used t o represent a language or group of languages.Writing systems can be classified into two categories:

Th is section also includes the following topics:

Phonetic Writing Systems

Phonetic writing systems consist of symbols that represent different sounds associated with a language. Greek, Latin, Cyrillic, an d Devanagari are all examples of phonetic writing systems based on alphabets. Note that alphabets can represent more than one languag e. For example, the Latin alphabet can represent many Western European languages such as French, German, and English.

Characters associated with a phonetic writing system can typically be encoded in one byte because the char acter repertoire is usually smaller than 256 characters.

Ideographic Writing Systems

Ideographic writing systems consist of ideographs or pictographs that represent the meaning of a word, not the sounds of a la nguage. Chinese and Japanese are examples of ideographic writing systems that are based on tens of thousands of ideographs. Languages that use ideographic writing systems may also use a syllabary. Syllabaries provide a mechanism for com municating additional phonetic information. For instance, Japanese has two syllabaries: Hiragana, normally used for grammatical eleme nts, and Katakana, normally used for foreign and onomatopoeic words.

Characters associated with an ideographic writing system typically are encoded in more than one byte because the character repertoire has tens of thousands of characters.

Punctuation, Control Characters, Numbers, and Symbols

In addi tion to encoding the script of a language, other special characters need to be encoded:

Writing Direction

< !--/TOC=h3-->

Most Western languages are written left to right from the top to the bottom of th e page. East Asian languages are usually written top to bottom from the right to the left of the page, although exceptions are freque ntly made for technical books translated from Western languages. Arabic and Hebrew are written right to left from the top to the bott om.

Numbers reverse direction in Arabic and Hebrew. Although the text is written right to l eft, numbers within the sentence are written left to right. For example, "I wrote 32 books" would be written as "skoob 32 etorw I". R egardless of the writing direction, Oracle stores the data in logical order. Logical order means the order that is used by someone ty ping a language, not how it looks on the screen.

Writing direction does not affect the enco ding of a character.

What Characters Does a Character Set Support?

Different character sets support different character repertoires. Because character sets are typically based on a particular writing script, th ey can support more than one language. When character sets were first developed, they had a limited character repertoire. Even now th ere can be problems using certain characters across platforms. The following CHAR and VARCHAR characters ar e represented in all Oracle database character sets and can be transported to any platform:

If you are using characters outside this set, then take care that your data is supported in the database c haracter set that you have chosen.

Setting the NLS_LANG parameter properly is essential to proper data conversion. The character set that is specified by the NLS_LANG parameter should reflect the se tting for the client operating system. Setting NLS_LANG correctly enables proper conversion from the client operating sy stem character encoding to the database character set. When these settings are the same, Oracle assumes that the data being sent or r eceived is encoded in the same character set as the database character set, so no validation or conversion is performed. This can lea d to corrupt data if conversions are necessary.

During conversion from one character set to another, Oracle expects client-side data to be encoded in the character set specified by the NLS_LANG parameter. If you put other values into the string (for example, by using the CHR or CONVERT SQL functions), then the values may be corrupted when they are sent to the database because they are not converted properly. If you have configured the environment correctly and if the database character set supports the entire repertoire of character data that may be input into the database, the n you do not need to change the current database character set. However, if your enterprise becomes more global and you have addition al characters or new languages to support, then you may need to choose a character set with a greater character repertoire. Oracle Co rporation recommends that you use Unicode databases and datatypes in these cases.

See Also:

ASCII Encoding

Table 2-2 shows how the ASCII character is encoded. Row and column headings denote h exadecimal digits. To find the encoded value of a character, read the column number followed by the row number. For example, the code value of the character A is 0x41.

Table 2-2 7-Bit ASCII Character Set  

0

%

I

{

- 0 1 2 3 4 5 6 7

NUL

DLE

SP

0

@

P

'

p

1

SOH

DC1

!

1

A

Q< /p>

a

q

2

STX

DC2

"

2

B

R

b

r

< /a>

3

ETX

DC3

#

3

C

< /td>

S

c

s

4

EOT

DC4

$

4

D

T

d

t

5

ENQ

NAK

5

E

U

e

u

6

ACK

SYN

&

6

F

V

f

v

7

BEL

ETB

'

7

G

W

g

w

8

BS

CAN

(

8

H

X

h

x

9

TAB

EM

)

9

Y

i

y

A

LF

< p class="TB">SUB

*

:

J

Z

j

z

B

VT

ESC

+

;

K

[

k

C

FF

FS

,

<

L

\

l

|

D

CR

GS

-

=

M

]

m

}

E

SO

RS

.

>

N

^

n

~

F

SI

US

/

?

O

_

o

DEL

Character set s have evolved to meet the needs of users around the world. New character sets have been created to support languages besides English . Typically, these new character sets support a group of related languages based on the same script. For example, the ISO 8859 charac ter set series was created to support different European languages. Table 2-3 shows the langu ages that are supported by the ISO 8859 character sets.

Table 2-3 lSO 8859 Character Sets  
< td class="Formal">

Northern European (Danish, English, Estonian, Finnish, German, Greenlandic, L atin, Latvian, Lithuanian, Norwegian, Sámi, Slovenian, Swedish)

ISO 8859-5

ISO 8859-13

Standard Languages Supported

ISO 8859-1

Western European (Albanian, Basque, Breton, Catalan, Danish, Dutch, English, Faeroese, Fi nnish, French, German, Greenlandic, Icelandic, Irish Gaelic, Italian, Latin, Luxemburgish, Norwegian, Portuguese, Rhaeto-Romanic, Sco ttish Gaelic, Spanish, Swedish)

ISO 8859-2

Eastern European (Albanian, Croatian, Cz ech, English, German, Hungarian, Latin, Polish, Romanian, Slovak, Slovenian, Serbian)

ISO 8859-3

Southeastern European (Afrikaans, Catalan, Dutch, English, Esperanto, German, Italian, Maltese, Spanish, Turkish)

ISO 8859-4

E astern European (Cyrillic-based: Bulgarian, Byelorussian, Macedonian, Russian, Serbian, Ukrainian)

ISO 8859-6

Arabic

ISO 8859-7

Greek

ISO 8859-8

Hebrew

ISO 8859-9

Western European (Albanian, Basque, Breton, Catalan, Cornish, Danish, Dutch, English, Finnish, French, Frisian, Galician, German, Greenlandic, Irish Gaelic, Italian, Lat in, Luxemburgish, Norwegian, Portuguese, Rhaeto-Romanic, Scottish Gaelic, Spanish, Swedish, Turkish)

ISO 8859-10

Northern European (Danish, English, Estonian, Faeroese, Finnish, German, Greenlandic, Icelandic, Irish Gaelic, Latin, Lithuanian, Norwegian, Sámi, Slovenian, Swedish)

Baltic Rim (English, Estonian, Finnish, Latin, Latvian, Norwegian)

ISO 8859-14

Celt ic (Albanian, Basque, Breton, Catalan, Cornish, Danish, English, Galician, German, Greenlandic, Irish Gaelic, Italian, Latin, Luxembu rgish, Manx Gaelic, Norwegian, Portuguese, Rhaeto-Romanic, Scottish Gaelic, Spanish, Swedish, Welsh)

ISO 8859-15

Western European (Albanian, Basque, Breton, Catalan, Danish, Dutch, English, Estonian, Faroese, Finnish , French, Frisian, Galician, German, Greenlandic, Icelandic, Irish Gaelic, Italian, Latin, Luxemburgish, Norwegian, Portuguese, Rhaet o-Romanic, Scottish Gaelic, Spanish, Swedish)

Character sets evol ved and provided restricted multilingual support. They were restricted in the sense that they were limited to groups of languages bas ed on similar scripts. More recently, universal character sets have been regarded as a more useful solution to multilingual support. Unicode is one such universal character set that encompasses most major scripts of the modern world. The Unicode character set suppor ts more than 94,000 characters.

See Also:

Chapter 6, "Supporting Multilingual D atabases with Unicode"

How are Characters Encoded?

Different types of encoding schemes have been created by the computer industry. The character set you choose affects what kind of e ncoding scheme is used. This is important because different encoding schemes have different performance characteristics. These charac teristics can influence your database schema and application development. The character set you choose uses one of the following type s of encoding schemes:

Single-Byte Encoding Schemes

Single-byte encoding schemes are efficient. They take up the least amount of space to represent characters and are easy to process and program with because one charac ter can be represented in one byte. Single-byte encoding schemes are classified as one of the following:

Figure 2-1 ISO 8859-1 8-Bit Encoding Scheme

Text description of iso88591.gif follows.

Text description of the illustration iso88591.gif

Multibyte Encoding Schemes

Multibyte encoding schemes are needed to support ideographic scripts used in Asian languages like Chinese or Japane se because these languages use thousands of characters. These encoding schemes use either a fixed number or a variable number of byte s to represent each character.

  • Fixed-width multibyte encoding schemes

    In a fixed-width multibyte encoding scheme, each character is represented by a fixed number of bytes. The number of bytes is at least two in a multibyte encoding scheme.

  • Variable-width multibyte encoding schemes

    A variable-width encoding sc heme uses one or more bytes to represent a single character. Some multibyte encoding schemes use certain bits to indicate the number of bytes that represents a character. For example, if two bytes is the maximum number of bytes used to represent a character, then th e most significant bit can be used to indicate whether that byte is a single-byte character or the first byte of a double-byte charac ter.

  • Shift-sensitive variable-width multibyte encoding schemes

    Some variable-width encoding schemes use control codes to differentiate between single-byte and mu ltibyte characters with the same code values. A shift-out code indicates that the following character is multibyte. A shift-in code i ndicates that the following character is single-byte. Shift-sensitive encoding schemes are used primarily on IBM platforms. Note that ISO-2022 character sets cannot be used as database character sets, but they can be used for applications such as a mail server.

    < /li>

Nam ing Convention for Oracle Character Sets

Oracle uses the following n aming convention for Oracle character set names:

<region><number of bits used to
represent a character><standard character set name>[S|C]

The parts of the names in angle brackets are concatenated. The optional S or C is used to differentiate cha racter sets that can be used only on the server (S) or only on the client (C).

< a name="1006639">
Note: < p class="NB">Use the server character set (S) on the Macintosh platform. The Macintosh client character sets are obsolet e. On EBCDIC platforms, use the server character set (S) on the server and the client character set (C) on the client.


< /table>

Table 2-4 shows examples of Oracle charac ter set names.

Table 2-4 Examples of Oracle Character Set Names  

N ote:

UTF8 and UTFE are exceptions to the naming convention.


Oracle Character Set Name Description Region Number of Bits Used to Represent a Charact er Standard Character Set Name

US7ASCII

U.S. 7-bit ASCII

US

7

ASCII

WE8ISO8859P1

Western European 8-bit ISO 8859 Part 1

WE (Western Europe)

8

ISO8859 Part 1

JA16SJIS

Japanese 16-bit Shifted Japanese Indu strial Standard

JA

16

SJIS

Length Semantics< /h2>

In single-byte character sets, the number of bytes and the number of charac ters in a string are the same. In multibyte character sets, a character or code point consists of one or more bytes. Calculating the number of characters based on byte lengths can be difficult in a variable-width character set. Calculating column lengths in bytes is called byte semantics, while measuring column lengths in characters is called cha racter semantics.

Character semantics were introduced in Oracle9i. Character semantics is useful for defining the storage requirements for multibyte strings of varying widths. For example, in a Unicode database (AL32UTF8), suppose that you need to define a VARCHAR2 column that can store up to five Chinese char acters together with five English characters. Using byte semantics, this column requires 15 bytes for the Chinese characters, which a re three bytes long, and 5 bytes for the English characters, which are one byte long, for a total of 20 bytes. Using character semant ics, the column requires 10 characters.

The following expressions use byte semantics:

< ul class="LB1">
  • VARCHAR2(20 BYTE)
  • SUBSTRB(string, 1, 20)
  • Note the BYTE qualifier in the VARCHAR2 expression and the B suffix in the SQL function n ame.

    The following expressions use character semantics:

    • VARCHAR2(10 CHAR)
    • SUBST R(string, 1, 10)

    Note the CHAR qualifier in the VARCHAR2 expression.

    The NLS_LENGTH_SEMANTICS i nitialization parameter determines whether a new column of character datatype uses byte or character semantics. The default value of the parameter is BYTE. The BYTE and CHAR qualifiers shown in the VARCHAR2 definit ions should be avoided when possible because they lead to mixed-semantics databases. Instead, set NLS_LENGTH_SEMANTICS i n the initialization parameter file and define column datatypes to use the default semantics based on the value of NLS_LENGTH_S EMANTICS.

    Byte semantics is the default for the database character set. Character le ngth semantics is the default and the only allowable kind of length semantics for NCHAR datatypes. The user cannot speci fy the CHAR or BYTE qualifier for NCHAR definitions.

    Consider the following example:

    CREATE TABLE employees
    
    ( employee_i
    d NUMBER(4)
    , last_name NVARCHAR2(10)
    , job_id NVARCHAR2(9)
    , manag
    er_id NUMBER(4)
    , hire_date DATE
    , salary NUMBER(7,2)
    , department_
    id NUMBER(2)
    ) ;
    

    When the NCHAR character set is AL16UTF16, last_name can hold up to 10 Unicode code points. When the NCHAR character set is AL16UTF16, last_name can hold up to 20 bytes.

    Figure 2-2 shows the number of bytes needed to store different kinds of characters in the UTF-8 character set. The ASCII c haracters requires one byte, the Latin and Greek characters require two bytes, the Asian character requires three bytes, and the supp lementary character requires four bytes of storage.

    Figure 2-2 Bytes of Storage for Different Kinds of Characters

    Text description of nlspg032.gif follows

    Text description of the illustration nlspg032.gif

    See Also:

    Choosing an Oracle Database Character Set

    Oracle uses the database character set for:

    • Data stored in SQL CHAR datatypes (CHAR, VARCHAR2, CLOB, and LONG)
    • Identifiers such as table names, column names, and PL/SQL variables
    • Entering and storing SQL and PL/SQL source code

    The character encoding scheme used by the database is defined as part of the CREATE DATABASE stateme nt. All SQL CHAR datatype columns (CHAR, CLOB, VARCHAR2, and LONG), including columns in the data dictionary, have their data stored in the database character set. In addition, the choice of database c haracter set determines which characters can name objects in the database. SQL NCHAR datatype columns (NCHAR, NCLOB, and NVARCHAR2) use the national character set.

    < table class="Note" border="0" width="80%" cellpadding="0" cellspacing="0" dir="ltr" summary="This is a layout table to format a note" title="This is a layout table to format a note">
    Note:

    CLOB data is stored in a format that is compatible with UCS-2 if the database character set is multibyte. If the database character set is single-byte, then CLOB data is stored in the database character set.


    After the database is created, you cannot change the character sets, with some exceptions, without re-creating the d atabase.

    Consider the following questions when you choose an Oracle character set for the database:

    • What languages does the databa se need to support now?
    • What languages does the database need to support in th e future?
    • Is the character set available on the operating system?
    • What character sets are used on clients?
    • How well does the application handle the character set?
    • What are the per formance implications of the character set?
    • What are the restrictions associat ed with the character set?

    The Oracle character sets are listed in "Character Sets". They are named according to the languages and regions in which th ey are used. Some character sets that are named for a region are also listed explicitly by language.

    If you want to see the characters that are included in a character set, then:

    • Check national, international, or vendor product documentation or standards documents
    • Use Oracle Locale Builder

    This section contains the following topics:

    Current and Future Language Requirements

    Several character sets may meet your current language requirements. Consider future language requirements when you choose a database character set. If you expect to support additional languages in the future, then choose a character set that supports those languages to prevent the need to migrate to a different character set later.

    Client Operating System and Application Compatibility

    The database character set is independent of the operating system because Orac le has its own globalization architecture. For example, on an English Windows operating system, you can create and run a database wit h a Japanese character set. However, when an application in the client operating system accesses the database, the client operating s ystem must be able to support the database character set with appropriate fonts and input methods. For example, you cannot insert or retrieve Japanese data on the English Windows operating system without first installing a Japanese font and input method. Another way to insert and retrieve Japanese data is to use a Japanese operating system remotely to access the database server.

    Character Set Conversion B etween Clients and the Server

    If you choose a database character set that is different from the character set on the client operating system, then the Oracle database can convert the operating system c haracter set to the database character set. Character set conversion has the following disadvantages:

    • Potential data loss
    • Increased overhead

    Character set conversions can sometimes cause data loss. For example, if you are con verting from character set A to character set B, then the destination character set B must have the same character set repertoire as A. Any characters that are not available in character set B are converted to a replacement character. The replacement character is of ten specified as a question mark or as a linguistically related character. For example, ä (a with an u mlaut) may be converted to a. If you have distributed environments, then consider using character sets with similar char acter repertoires to avoid loss of data.

    Character set conversion may require copying strin gs between buffers several times before the data reaches the client. The database character set should always be a superset or equiva lent of the native character set of the client's operating system. The character sets used by client applications that access the dat abase usually determine which superset is the best choice.

    If all client applications use t he same character set, then that character set is usually the best choice for the database character set. When client applications us e different character sets, the database character set should be a superset of all the client character sets. This ensures that every character is represented when converting from a client character set to the database character set.

    Performance Implications of Choosing a Datab ase Character Set

    For best performance, choose a character set that avoids character set conversion and uses the most efficient encoding for the languages desired. Single-byte character sets result in better performance than multibyte character sets, and they also are the most efficient in terms of space requirements. However, singl e-byte character sets limit how many languages you can support.

    Restrictions on Database Character Sets

    ASCII-based character sets are supported only on ASCII-based platforms. Similarly, you can use an EBCDIC-b ased character set only on EBCDIC-based platforms.

    The database character set is used to id entify SQL and PL/SQL source code. In order to do this, it must have either EBCDIC or 7-bit ASCII as a subset, whichever is native to the platform. Therefore, it is not possible to use a fixed-width, multibyte character set as the database character set. Currently, only the AL16UTF16 character set cannot be used as a database character set.

    Restrictions on Character Sets Used to Express Names

    Table 2-5 lists the restrictions on the character sets that can be used to express names.

    Table 2-5 Restrictions on Character Sets Used to Express Names  
    Name Single-Byte Variable Widt h Comments

    column names

    Yes

    Yes

    -

    schema objects

    Yes

    Yes

    -

    comments

    Y es

    Yes

    -

    database link names

    Yes

    No

    -

    database names

    Yes

    No

    -

    file names (datafile, log file, control file, initialization parameter file)

    Yes

    No

    -

    instance names

    Yes

    No

    -

    directory names

    Yes

    No

    -

    keywords

    Yes

    No

    Can be expressed in Englis h ASCII or EBCDIC characters only

    Recovery Manager file names

    Yes

    No

    -

    rollback segment names

    Yes

    No< /p>

    The ROLLBACK_SEGMENTS parameter does not support NLS

    stored script nam es

    Yes

    Yes

    -

    tablespace names

    Yes

    No

    -

    For a list of supported string formats and ch aracter sets, including LOB data (LOB, BLOB, CLOB, and NCLOB), see Table 2-7.

    Choosing a National Character Set

    A national character set is an alternate character set that enables you to store Unicode character data in a database that does not have a Unicode database character set. Other reasons for choosing a nationa l character set are:

    • The properties of a different character e ncoding scheme may be more desirable for extensive character processing operations
    • Programming in the national character set is easier

    SQL NCHAR, NVARCHAR2, and NCLOB datatypes have been redefined to support Unicode data only. You can use either the UTF8 or the AL 16UTF16 character set. The default is AL16UTF16.

    See Also:

    Chapter 6, " Supporting Multilingual Databases with Unicode"

    Summary of Supported Datatypes

    Table 2-6 lists the datatypes that are supported for different encoding schemes.

    < em>Table 2-6 SQL Datatypes Supported for Encoding Schemes 
    Datatype< /strong> Single Byte < /a> Multibyte Non-Unicode Multibyte Unicode

    CHAR

    Yes

    Yes

    Yes

    VARCHAR2

    Yes

    Yes

    Yes

    NCHAR

    No

    No

    Yes

    NVARCHAR2

    No

    No

    < /td>

    Yes

    BLOB

    Yes

    Yes

    Yes

    CLOB

    Yes

    Yes

    Yes

    LONG

    Yes

    Yes

    Yes

    NCLOB

    No

    No

    Yes


    Note:

    BLOBs process characters as a series of byte sequences. The data is not subject to any N LS-sensitive operations.


    Table& nbsp;2-7 lists the SQL datatypes that are supported for abstract datatypes.

    Table 2-7 Abstract Datatype Support for SQL Datatypes

    Yes

    Abstract Datatype CHAR NCHAR BLOB CLOB NCLOB

    Object

    Yes

    Yes

    Yes

    Yes

    Yes

    Collection

    Yes

    Yes

    < /td>

    Yes

    Y es

    You can create an abstract datatype with the NCHAR attribute as follows:

    SQL> CREATE TYPE tp1 AS OBJECT (a NCHAR(10));
    
    Type created.
    SQL> CREATE TABLE t1 (a tp1);
    Table created.
    

    Ch anging the Character Set After Database Creation

    You may wish to cha nge the database character set after the database has been created. For example, you may find that the number of languages that need to be supported in your database has increased. In most cases, you need to do a full export/import to properly convert all data to th e new character set. However, if, and only if, the new character set is a strict superset of the current character set, then it is po ssible to use the ALTER DATABASE CHARACTER SET statement to expedite the change i n the database character set.

    See Also:

    Oracle Database App lication Developer's Guide - Object-Relational Features for more information about objects and collections

    See Al so:

    Monolingual Database Scenario

    Th e simplest example of a database configuration is a client and a server that run in the same language environment and use the same ch aracter set. This monolingual scenario has the advantage of fast response because the overhead associated with character set conversi on is avoided. Figure 2-3 shows a database server and a client that use the same character se t. The Japanese client and the server both use the JA16EUC character set.

    Figure 2-3 Monolingual Database Scenario

    Text description of nlspg025.gif follows

    Text description of the illustration nlspg025.gi f

    You can also use a multitier architecture. Figure  ;2-4 shows an application server between the database server and the client. The application server and the database server use t he same character set in a monolingual scenario. The server, the application server, and the client use the JA16EUC character set.

    Figure 2-4 Multitier Monolingual Database Scen ario

    Text description of nlspg026.gif follows

    Text description of the illustration nlspg026.gif

    Character Set Conversion in a Monolingual Scenario

    Character set conversion may be required in a client/server environment if a client ap plication resides on a different platform than the server and if the platforms do not use the same character encoding schemes. Charac ter data passed between client and server must be converted between the two encoding schemes. Character conversion occurs automatical ly and transparently through Oracle Net.

    You can convert between any two character sets. Figure 2-5 shows a server and one client with the JA16EUC Japanese character set. The other cli ent uses the JA16SJIS Japanese character set.

    Figure 2-5 Character Set Conversion

    Text description of nlspg027.gi
f follows

    Text description of the illustration nlspg027.gif

    When a target character set does not contain all of the characters in the source data, replacement characters are used . If, for example, a server uses US7ASCII and a German client uses WE8ISO8859P1, then the German character ß is re placed with ? and ä is replaced with a.

    Replace ment characters may be defined for specific characters as part of a character set definition. When a specific replacement character i s not defined, a default replacement character is used. To avoid the use of replacement characters when converting from a client char acter set to a database character set, the server character set should be a superset of all the client character sets.

    Figure 2-6 shows that data loss occurs when the database characte r set does not include all of the characters in the client character set. The database character set is US7ASCII. The client's charac ter set is WE8MSWIN1252, and the language used by the client is German. When the client inserts a string that contains ß< /code>, the database replaces ß with ?, resulting in lost data.

    Figure 2-6 Data Loss During Character Conversion

    Text description of nlspg033.gif follows

    Text description o f the illustration nlspg033.gif

    If German data is expected to be stored on the server, then a database character set that supports German characters should be used for both the server and the client to avoid data loss an d conversion overhead.

    When one of the character sets is a variable-width multibyte charact er set, conversion can introduce noticeable overhead. Carefully evaluate your situation and choose character sets to avoid conversion as much as possible.

    Multilingual Database Scenarios

    Multilingual support ca n be restricted or unrestricted. This section contains the following topics:

    Restricted Multilingual Support

    Some character sets support multiple languages because they have related writing systems or scripts . For example, the WE8ISO8859P1 Oracle character set supports the following Western European languages:

    Catalan
    Danish
    Dut ch
    English
    Finnish
    French
    German
    Icelandic
    Italian
    Norwegian
    Portuguese
    Spanish
    Swedish

    These languages all use a Latin-based writing script.

    When you use a character set that supports a group of languages, your database has restricted multilingual support.

    Figure  2-7 shows a Western European server that used the WE8ISO8850P1 Oracle character set, a French client that uses the same cha racter set as the server, and a German client that uses the WE8DEC character set. The German client requires character conversion bec ause it is using a different character set than the server.

    Figure 2-7 Restricted Multilingual Support

    Text descr
iption of nlspg028.gif follows

    Text description of the illustration nlspg028.gif

    Unrestricted Mu ltilingual Support

    If you need unrestricted multilingual support, th en use a universal character set such as Unicode for the server database character set. Unicode has two major encoding schemes:

    < ul class="LB1">
  • UTF-16: Each character is either 2 or 4 bytes long.
  • UTF-8: Each character takes 1 to 4 bytes to store.
  • The database provides support for UTF-8 as a database character set and both UTF-8 and UTF-16 as national character sets.

    < a name="1007330">

    Character set conversion between a UTF-8 database and any single-byte character set introduces v ery little overhead.

    Conversion between UTF-8 and any multibyte character set has some over head. There is no data loss from conversion with the following exceptions:

    • Some multibyte character sets do not support user-defined characters during character set conversion to and from UTF-8.< /li>
    • Some Unicode characters are mapped to more than character in another character set. For example, one Unicode character is mapped to three characters in the JA16SJIS character set. This means that a round-trip co nversion may not result in the original JA16SJIS character.

    Figure 2-8 shows a server that uses the AL32UTF8 Oracle character set that is based on the Unicode UTF-8 character set .

    Figure 2-8 Unrestricted Multilingual Suppo rt Scenario in a Client/Server Configuration

    Text description of nlsp
g029.gif follows

    Text description of the illustration nlspg029.gif

    There are four clients:

    • A French client that uses the WE8ISO8859P1 Oracle character set
    • A German client that uses the WE8DEC character set
    • A Japanese client that uses the JA16EUC character set
    • A Japanese client that used the JA16SJIS character set

    Character conversion takes place between each client and the server, but there is no data loss because AL32UTF8 is a universal character set. If the German client tries to retrieve data from one of the Japanese clients, then all of the Japanese characters in the data are lost during the character set conversion.

    Figure 2-9 shows a Unicode solution for a multitier configuration.

    Figure 2-9 Multitier Unrestricted Multilingual Support Scenario in a Multitier Configuratio n

    Text description of nlspg030.gif follows

    Text description of the illustration nlspg030.gif

    The database, the app lication server, and each client use the AL32UTF8 character set. This eliminates the need for character conversion even though the cl ients are French, German, and Japanese.

    See Also:

    Chapter 6, "Supporting Multil ingual Databases with Unicode"