A mark that changes the sound of a character. Because the common meaning of
accent is associated with the stress or prominence of the character's sound, the preferred word in Oracle Database Globalization Support Guide is diacritic.
The defau
lt Oracle character set for the SQL NCHAR data type, which is used for the national character set. It encodes Unicode data in the UTF
-16 encoding.
American Standard Code for Information Interchange. A com
mon encoded 7-bit character set for English. ASCII includes the letters A-Z and a-z, as well as digits, punctuation symbols, and cont
rol characters. The Oracle character set name is US7ASCII.
base letter
A character without diacritics. For example, the base letter for a, A, ä<
/code>, and Ä is a
A basic equivalence between ch
aracters or sequences of characters. For example, ç is equivalent to the combination of c and
,. They cannot be distinguished when they are correctly rendered.
case
Refers to the condition of being uppercase or lowercase. For example, in a Latin alphabet, A is the uppercase glyph for a, the lowercase glyph.
case conversion
Changing a character from uppercase to lowercase or vice versa.
case-insensitive linguistic sort
A linguistic sort that uses information about bas
e letters and diacritics but not case.
A charac
ter is an abstract element of text. A character is different from a glyph, which is a specific representation of a character. For exa
mple, the first character of the English upper-case alphabet can be displayed as A, A, A, and so on. These forms are different glyphs
that represent the same character. A character, a character code, and a glyph are related as follows:
character --(encoding)--> character code --(font)--> glyph
For example, the
first character of the English uppercase alphabet is represented in computer memory as a number. The number is called the encoding or the character code. The character code for the first character of the E
nglish uppercase alphabet is 0x41 in the ASCII encoding scheme. The character code is 0xc1 in the EBCDIC encoding scheme.
You must choose a font to display or print the character. The available fonts depend on which encoding
scheme is being used. The character can be printed or displayed as A, A, or A, for example. The forms are different glyphs that represent the same character.
A character code is a number that represents a specific character. The numbe
r depends on the encoding scheme. For example, the character code of the first character of the English uppercase alphabet is 0x41 in
the ASCII encoding scheme, but it is 0xc1 in the EBCDIC encoding scheme.
See also <
a href="glossary.htm#996845">byte semantics and length semantics.
character set
A collection of elements that represent textual information for
a specific language or group of languages. One language can be represented by more than one character set.
A character set does not always imply a specific character encoding scheme. A character encoding scheme is the assign
ment of a character code to each character in a character set.
In this manual, a character s
et usually does imply a specific character encoding scheme. Therefore, a character set is the same as an encoded character set in thi
s manual.
character set migration
Changing the c
haracter set of an existing database.
character string
An ordered group of characters.
A character string can also contain no characters. In
this case, the character string is called a null string. The number of characters in a null string is
0 (zero).
character classification
Character cla
ssification information provides details about the type of character associated with each character code. For example, a character ca
n uppercase, lowercase, punctuation, or control character.
character encoding scheme
A rule that assigns numbers (character codes) to all characters in a character set. Encoding scheme, encoding method, and encoding also mea
n character encoding scheme.
client character set
The encoded character set used by the client. A client character set can differ from the server chara
cter set. The server character set is called the database character set. If the client character set is
different from the database character set, then character set conversion must occur.
The numeric representation of a character in a character set. For example, the code p
oint of A in the ASCII character set is 0x41. The code point of a character is also called the enco
ded value of a character.
The unit
of encoded text for processing and interchange. The size of the code unit varies depending on the character encoding scheme. In most
character encodings, a code unit is 1 byte. Important exceptions are UTF-16 and UCS-2, which use 2-byte code units, and wide charact
er, which uses 4 bytes.
Ordering of
character strings according to rules about sorting characters that are associated with a language in a specific locale. Also called <
strong class="Bold">linguistic sort.
The process of identifying potential problems with character set conversion and truncation of data before migrating the database c
haracter set.
database character set
The encoded
character set that is used to store text in the database. This includes CHAR, VARCHAR2, LONG,
and fixed-width CLOB column values and all SQL and PL/SQL text.
diacritic
A mark near or through a character or combination of characters that indicates a different so
und than the sound of the character without the diacritical mark. For example, the cedilla in façade is a diacrit
ic. It changes the sound of c.
EBCDIC
Extended Binary Coded Decimal Interchange Code. EBCDIC is a family of encoded character sets used mostly on IBM systems.
encoded character set
A character set with an associated
character encoding scheme. An encoded character set specifies the number (character code) that is assigned to each character.
The numeric representation of a charact
er in a character set. For example, the code point of A in the ASCII character set is 0x41. The encoded value of a chara
cter is also called the code point of a character.
font
An ordered collection of character glyphs that provides a graphical representation of character
s in a character set.
globalization
The process
of making software suitable for different linguistic and cultural environments. Globalization should not be confused with localizatio
n, which is the process of preparing software for use in one specific locale.
glyph
<
a name="996990">
A glyph (font glyph) is a specific representation of a character. A character can have many diffe
rent glyphs. For example, the first character of the English uppercase alphabet can be printed or displayed as A, A, A, and so on. Th
ese forms are different glyphs that represent the same character.
A symbol that represents an idea. Chinese is an example of an ideographic writing system.
ISO
International Organization for Standards. A worldwide federation of natio
nal standards bodies from 130 countries. The mission of ISO is to develop and promote standards in the world to facilitate the intern
ational exchange of goods and services.
ISO 8859
A family of 8-bit encoded character sets. The most common one is ISO 8859-1 (also known as ISO Latin1), and is used for Western Euro
pean languages.
ISO 14651
A multilingual linguis
tic sort standard that is designed for almost all languages of the world.
A universal character set standard that defines the characters of most major scripts use
d in the modern world. In 1993, ISO adopted Unicode version 1.1 as ISO/IEC 10646-1:1993. ISO/IEC 10646 has two formats: UCS-2 is a 2-
byte fixed-width format, and UCS-4 is a 4-byte fixed-width format. There are three levels of implementation, all relating to support
for composite characters:
Level 1 requires no composite charact
er support.
Level 2 requires support for specific scripts (including most of th
e Unicode scripts such as Arabic and Thai).
Level 3 requires unrestricted suppo
rt for composite characters in all languages.
ISO currency
The 3-letter abbreviation used to denote a local currency, based on the ISO 4217 standard. For example, USD represents the United States dollar.
ISO Latin1
The ISO 8859-1 character set standard. It is an 8-bit extension to ASCII that adds 128 characters that include the most common L
atin characters used in Western Europe. The Oracle character set name is WE8ISO8859P1.
A collection of information about the linguistic
and cultural preferences from a particular region. Typically, a locale consists of language, territory, character set, linguistic, a
nd calendar information defined in NLS data files.
localization
The process of providing language-specific or culture-specific information for software systems. Translation of an ap
plication's user interface is an example of localization. Localization should not be confused with globalization, which is the making
software suitable for different linguistic and cultural environments.
monolingual linguis
tic sort
An Oracle sort that has two levels of comparison for strings. Most European langua
ges can be sorted with a monolingual sort, but it is inadequate for Asian languages.
When character cod
es are assigned to all characters in a specific language or a group of languages, one byte (8 bits) can represent 256 different chara
cters. Two bytes (16 bits) can represent up to 65,536 different characters. Two bytes are not enough to represent all the characters
for many languages. Some characters require 3 or 4 bytes.
One example is the UTF8 Unicode en
coding. In UTF8, there are many 2-byte and 3-byte characters.
Another example is Traditional
Chinese, used in Taiwan. It has more than 80,000 characters. Some character encoding schemes that are used in Taiwan use 4 bytes to
encode characters.
A character whose cha
racter code consists of two or more bytes under a certain character encoding scheme.
Note th
at the same character may have different character codes under different encoding schemes. Oracle cannot tell whether a character is
a multibyte character without knowing which character encoding scheme is being used. For example, Japanese Hankaku-Katakana (half-wid
th Katakana) characters are one byte in the JA16SJIS encoded character set, two bytes in JA16EUC, and three bytes in UTF8.
A character string that consists
of one of the following:
No characters (called a null string)
One or more single-byte characters
A mixture of one or more single-byte characters and one or more multibyte characters
One or more multibyte characters
multil
ingual linguistic sort
An Oracle sort that uses evaluates strings on three levels. Asian la
nguages require a multilingual linguistic sort even if data exists in only one language. Multilingual linguistic sorts are also used
when data exists in several languages.
national character set
An alternate character set from the database character set that can be specified for NCHAR, NVARCHAR
2, and NCLOB columns. National character sets are in Unicode only.
NLB
files
Binary files used by the Locale Builder to define locale-specific data. They define
all of the locale definitions that are shipped with a specific release of the Oracle database server. You can create user-defined NLB
files with Oracle Locale Builder.
National Language Support. NLS allows users to interact with the database in their native langua
ges. It also allows applications to run in different linguistic and cultural environments. The term is somewhat obsolete because Orac
le supports global users at one time.
NLSRTL
Nat
ional Language Support Runtime Library. This library is responsible for providing locale-independent algorithms for internationalizat
ion. The locale-specific information (that is, NLSDATA) is read by the NLSRTL library during run-time.
NLT files
Text files used by the Locale Builder to define locale-specific data
. Because they are in text, you can view the contents.
A GUI utility that offers a way to view, modify, or define locale-specific data. You can als
o create your own formats for language, territory, character set, and linguistic sort.
rep
lacement character
A character used during character conversion when the source character i
s not available in the target character set. For example, ? is often used as Oracle's default replacement character.
restricted multilingual support
Multilingual suppor
t that is restricted to a group of related languages.Western European languages can be represented with ISO 8859-1, for example. If m
ultilingual support is restricted, then Thai could not be added to the group.
SQL CHAR dat
atypes
Includes CHAR, VARCHAR, VARCHAR2, CLOB<
/code>, and LONG datatypes.
SQL NCHAR datatypes
<
p class="BP">Includes NCHAR, NVARCHAR, NVARCHAR2, and NCLOB datatypes.
script
A collection of related graphic symbols that are us
ed in a writing system. Some scripts can represent multiple languages, and some languages use multiple scripts. Example of scripts in
clude Latin, Arabic, and Han.
single byte
One by
te. One byte usually consists of 8 bits. When character codes are assigned to all characters for a specific language, one byte (8 bit
s) can represent 256 different characters.
A single-byte character is a character whose character code consists of one byte under a specific character encoding scheme. Note
that the same character may have different character codes under different encoding schemes. Oracle cannot tell which character is a
single-byte character without knowing which encoding scheme is being used. For example, the euro currency symbol is one byte in the W
E8MSWIN1252 encoded character set, two bytes in AL16UTF16, and three bytes in UTF8.
A single-byte character string is a character string that consists of on
e of the following:
No character (called a null string)
One or more single-byte characters
supplementary characters
The first version of Unicode was a 16-
bit, fixed-width encoding that used two bytes to encode each character. This allowed 65,536 characters to be represented. However, mo
re characters need to be supported because of the large number of Asian ideograms.
Unicode 3
.1 defines supplementary characters to meet this need. It uses two 16-bit code units (also known as surrogate pa
irs) to represent a single character. This allows an additional 1,048,576 characters to be defined. The Unicode 3.1 standard
added the first group of 44,944 supplementary characters.
Provide a mechanism for communicating phonetic info
rmation along with the ideographic characters used by languages such as Japanese.
UCS-2
A 1993 ISO/IEC standard character set. It is a fixed-width, 16-bit Unicode character set. Eac
h character occupies 16 bits of storage. The ISO Latin1 characters are the first 256 code points, so it can be viewed as a 16-bit ext
ension of ISO Latin1.
UCS-4
A fixed-width, 32-bi
t Unicode character set. Each character occupies 32 bits of storage. The UCS-2 characters are the first 65,536 code points in this st
andard, so it can be viewed as a 32-bit extension of UCS-2. This is also sometimes referred to as ISO-10646.
Unicode
Unicode is a universal encoded character set that allows you inf
ormation from any language to be stored by using a single character set. Unicode provides a unique code value for every character, re
gardless of the platform, program, or language.
Unicode database
A database whose database character set is UTF-8.
Unicode code point
A value in the Unicode codespace, which ranges from 0 to 0x10FFFF. Unicode assigns a unique cod
e point to every character.
Unicode datatype
A S
QL NCHAR datatype (NCHAR, NVARCHAR2, and NCLOB). You can store Unicode characters
in columns of these datatypes even if the database character set is not Unicode.
unrestri
cted multilingual support
The ability to use as many languages as desired. A universal char
acter set, such as Unicode, helps to provide unrestricted multilingual support because it supports a very large character repertoire,
encompassing most modern languages of the world.
UTFE
A Unicode 3.0 UTF-8 Oracle database character set with 6-byte supplementary character support. It is used only on EBCDIC platf
orms.
UTF8
The UTF8 Oracle character set encodes
characters in one, two, or three bytes. It is for ASCII-based platforms. The UTF8 character set supports Unicode 3.0. Although speci
fic supplementary characters were not assigned code points in Unicode until version 3.1, the code point range was allocated for suppl
ementary characters in Unicode 3.0. Supplementary characters are treated as two separate, user-defined characters that occupy 6 bytes
.
UTF-8
The 8-bit encoding of Unicode. It is a v
ariable-width encoding. One Unicode character can be 1 byte, 2 bytes, 3 bytes, or 4 bytes in UTF-8 encoding. Characters from the Euro
pean scripts are represented in either 1 or 2 bytes. Characters from most Asian scripts are represented in 3 bytes. Supplementary cha
racters are represented in 4 bytes.
UTF-16
The 1
6-bit encoding of Unicode. It is an extension of UCS-2 and supports the supplementary characters defined in Unicode 3.1 by using a pa
ir of UCS-2 code points. One Unicode character can be 2 bytes or 4 bytes in UTF-16 encoding. Characters (including ASCII characters)
from European scripts and most Asian scripts are represented in 2 bytes. Supplementary characters are represented in 4 bytes.
wide character
A fixed-width character format that is u
seful for extensive text processing because it allows data to be processed in consistent, fixed-width chunks. Wide characters are int
ended to support internal character processing.