| Oracle® Database Conc
epts 10g Release 1 (10.1) Part Number B10743-01 |
|
|
View PDF |
This chapter p rovides an overview of Oracle's content management features.
This chapter contains the following topics:
Oracle Database includes datat ypes to handle all the types of rich Internet content such as relational data, object-relational data, XML, text, audio, video, image , and spatial. These datatypes appear as native types in the database. They can all be queried using SQL. A single SQL statement can include data belonging to any or all of these datatypes.
As applications evolve to encompass increasingly richer semantics, th ey encounter the need to deal with the following kinds of data:
Simple structured data
Complex structured data
Semi-structured data
Unstructured data< /p>
Traditionally, the relational model has been very successful at dealing with simple structured data -- the kind whic h can fit into simple tables. Oracle added object-relational features so that applications can deal with complex structured data -- c ollections, references, user-defined types and so on. Queuing technologies, such as Oracle Streams Advanced Queuing, deal with messag es and other semi-structured data. This chapter discusses Oracle's technologies to support unstructured data.
Unstructured dat a cannot be decomposed into standard components. Data about an employee can be 'structured' into a name (probably a character string) , an identification (likely a number), a salary, and so on. But if you are given a photo, you find that the data really consists of a long stream of 0s and 1s. These 0s and 1s are used to switch pixels on or off, so that you see the photo on a display, but it cannot be broken down into any finer structure in terms of database storage.
Unstructured data such as text, graphic images, still v ideo clips, full motion video, and sound waveforms tend to be large -- a typical employee record may be a few hundred bytes, but even small amounts of multimedia data can be thousands of times larger. Some multimedia data may reside on operating system files, and it is desirable to access them from the database.
Extensible Markup Language (XML) is a tag-based markup language that lets developers create their own tags to descri be data that's exchanged between applications and systems over the Internet. XML is widely adopted as the common language of informat ion exchange between companies. One reason for its popularity is its ease of use: XML documents and XML-based messages can be sent ea sily over the Internet using common protocols, such as HTTP or FTP.
Oracle XML DB tre ats XML as a native datatype in the database. Oracle XML DB is not a separate server. The XML data model encompasses both unstructure d content and structured data. Applications can use standard SQL and XML operators to generate complex XML documents from SQL queries and to store XML documents.
Oracle XML DB provides new capabilities for both content-oriented and data-oriented access. For d evelopers who see XML as documents (news stories, articles, and so on), Oracle XML DB provides an XML repository accessible from stan dard protocols and SQL.
For others, the structured-data aspect of XML (invoices, addresses, and so on) is more important. For these users, Oracle XML DB provides a native XMLType, support for XML Schema, XPath, XSL-T, DOM, and so on. The data oriented access is typically more query-intensive.
The Oracle XML developer's kits (XDK) contain the basic building blocks for reading, manipulating, transforming, and viewing XML documents. They are available for Java, JavaBeans, C, C++, and PL/SQL. Unlike many shareware and trial XML components, the production Oracle XDKs are fully supported and come with a comme rcial redistribution license. Oracle XDKs consist of the following components:
XML Parsers: supporting J ava, C, C++, and PL/SQL, the components create and parse XML using industry standard DOM and SAX interfaces.
XSLT Processor: transforms or renders XML into other text-based formats, such as HTML.
XML Schema P rocessor: supporting Java, C, and C++, allows use of XML simple and complex datatypes.
XML Class Generat or: automatically generates Java and C++ classes from DTDs and schemas to send XML data from Web forms or applications.
XML Transviewer Java Beans: visually view and transform XML documents and data with Java components.
XML SQL Utility: supporting Java, generates XML documents, DTDs, and schemas from SQL queries.
XSQL Servlet: combines XML, SQL, and XSLT in the server to deliver dynamic Web content.
See Also:
The large object (LOB) datatypes BLOB, CLOB, NCLOB, and BFILE
enable you to store and manipulate large blocks of unstructured data (such as text, graphic images, video clips, and sound waveforms)
in binary or character format. They provide efficient, random, piece-wise access to the data.
With the growth of the internet and content-rich applications, it has become imperative that the database support a datatype that fulfills the following:
Can store unstructured data
Is optimized for large amounts of such data
Provides a uniform way of accessing large unstructured data within the database or outside
Oracle Text indexes any document or textual content to add fast, accurate retrieval of information to internet content management applications, e-Business catalogs, news servic es, job postings, and so on. It can index content stored in file systems, databases, or on the Web.
Oracle Text allows text se arches to be combined with regular database searches in a single SQL statement. It can find documents based on their textual content, metadata, or attributes. The Oracle Text SQL API makes it simple and intuitive to create and maintain Text indexes and run Text sear ches.
Oracle Text is completely integrated with the Oracle database, making it inherently fast and scalable. The Text index is in the database, and Text queries are run in the Oracle process. The Oracle optimizer can choose the best execution plan for any que ry, giving the best performance for ad hoc queries involving Text and structured criteria. Additional advantages include the followin g:
Oracle Text supports multilingual querying and indexing.
You can index a nd define sections for searching in XML documents. Section searching lets you narrow down queries to blocks of text within documents. Oracle Text can automatically create XML sections for you.
A Text index can span many Text columns, giv ing the best performance for Text queries across more than one column.
Oracle Text has enhanced performa nce for operations that are common in Text searching, like count hits.
Oracle Text leverages scalability features, such as replication.
Oracle Text supports local partitioned index.
There are three Text index types to cover all text search needs.
Standard index type f or traditional full-text retrieval over documents and Web pages. The context index type provides a rich set of text search capabiliti es for finding the content you need, without returning pages of spurious results.
Catalog index type, de signed specifically for e-Business catalogs. This catalog index provides flexible searching and sorting at Web-speed.
Classification index type for building classification or routing applications. This index is created on a table of quer ies, where the queries define the classification or routing criteria.
Oracle Text also provides substring and prefix indexes. Substring indexing improves performance for left-truncated or double-truncated wildcard queries. Prefix indexing improves p erformance for right truncated wildcard queries.
Oracle Text pro vides a number of utilities to view text, no matter how that text is stored.
Oracle Text supports over 1 50 document formats through its Inso filtering technology, including all common document formats like XML, PDF, and MS Office. You ca n also create your own custom filter.
You can view the HTML version of any text, including formatted doc uments such as PDF, MS Office, and so on.
You can view the HTML version of any text, with search terms h ighlighted and with navigation to next/previous term in the text.
Oracle Text provides markup informatio n; for example, the offset and length of each search term in the text, to be used for example by a third party viewer.
The CTX_QUERY PL/SQL package can be used to generate query
feedback, count hits, and create stored query expressions.
With Oracle Text, you can find, classify, and cluster documents b ased on their text, metadata, or attributes.
Document classification performs an action based on document content. Actions can be assigned category IDs to a document for future lookup or for sending a document to a user. The result is a set, or stream, of cat egorized documents. For example, assume that there is an incoming stream of news articles. You can define a rule to represent the cat egory of Finance. The rule is essentially one or more queries that select documents about the subject of finance. The rule might have the form 'stocks or bonds or earnings.' When a document arrives that satisfies the rules for this category, the application takes an action, such as tagging the document as Finance or emailing one or more users.
Clustering is the unsupervised division of pat terns into groups. The interface lets users select the appropriate clustering algorithm. Each cluster contains a subset of documents of the collection. A document within a cluster is believed to be more similar with documents inside the cluster than with outside doc uments. Clusters can be used to build features like presenting similar documents in the collection.
Oracle Ultra Search is built on the Oracle database server and Oracle Text technology that provides uniform search-and-locate capabilities over multiple repositori es: Oracle databases, other ODBC compliant databases, IMAP mail servers, HTML documents served up by a Web server, files on disk, and more.
Ultra Search uses a ‘crawler' to index documents; the documents stay in their own repositories, and the crawled information is used to build an index that stays within your firewall in a designated Oracle database. Ultra Search also provides API s for building content management solutions.
Ultra Search offers the following:
A complete text q uery language for text search inside the database
Full integration with the Oracle database server and t he SQL query language
Advanced features like concept searching and theme analysis
Indexing of all common file formats (150+)
Full globalization, including support for Chinese, Jap anese and Korean (CJK), and Unicode
Oracle interMedia provides an array of se rvices to develop and deploy traditional, Web, and wireless applications that include rich media. Multimedia content can be stored an d managed directly in Oracle, or Oracle can store and index metadata together with external references that enable efficient access t o media content stored outside the database.
Oracle interMedia services includes the following:
Parse, index, and store rich content using new or existing database schemas
Develop content rich Web applications
Deploy rich content on the Web
Use standard Oracle Databa se features to create scalable, manageable media content repositories
Oracle interMedia provides a number o f load mechanisms ranging from low volume graphical user interface load utilities, through programmatic load APIs, to bulk media load ers. At load time, interMedia can extract the rich metadata that accompanies the media and use Oracle Text's text indexing and retrie val capabilities to build indexes for query and retrieval of the rich media content based upon the metadata.
Oracle inter< /em>Media allows for access to image, audio, and video data in most common Internet formats from a variety of sources, both within Or acle Database and from external locations, such as Web URL sites or specialized servers.
interMedia supports delivery of video through streaming servers such as the RealNetworks RealAudio and RealVideo Servers. interMedia supports drag and drop of au dio, video, and image data through the interMedia clipboard into Web applications such as Oracle Application Server Portal and popula r Web authoring tools. interMedia also supports efficient development of media rich Java based Internet applications through Oracle J Developer and dynamic Web page composition through MacroMedia's Ultradev.
A common example of spatial data can be seen in a road map. A road map is a two-dimensio nal object that contains points, lines, and polygons that can represent cities, roads, and political boundaries such as states or pro vinces. A road map is a visualization of geographic information. The location of cities, roads, and political boundaries that exist o n the surface of the Earth are projected onto a two-dimensional display or piece of paper, preserving the relative positions and rela tive distances of the rendered objects.
The data that indicates the Earth location (latitude and longitude, or height and dept h) of these rendered objects is the spatial data. When the map is rendered, this spatial data is used to project the locations of the objects on a two-dimensional piece of paper. A GIS is often used to store, retrieve, and render this Earth-relative spatial data.
Types of spatial data that can be stored using Spatial other than GIS data include data from computer-aided design (CAD) and com puter-aided manufacturing (CAM) systems. Instead of operating on objects on a geographic scale, CAD/CAM systems work on a smaller sca le, such as for an automobile engine or printed circuit boards.
The differences among these systems are only in the relative s izes of the data, not the data's complexity. The systems might all actually involve the same number of data points. On a geographic s cale, the location of a bridge can vary by a few tenths of an inch without causing any noticeable problems to the road builders, wher eas if the diameter of an engine's pistons are off by a few tenths of an inch, the engine will not run. A printed circuit board is li kely to have many thousands of objects etched on its surface that are no bigger than the smallest detail shown on a road builder's bl ueprints.
These applications all store, retrieve, update, or query some collection of features that have both nonspatial and s patial attributes. Examples of nonspatial attributes are name, soil_type, landuse_classification, and part_number. The spatial attrib ute is a coordinate geometry, or vector-based representation of the shape of the feature.
Oracle Spatial provides a SQL schema and functions that facilitate the storage, retrieval, update, and query of collections of spatial features in an Oracle Database. Sp atial consists of the following components:
A schema (MDSYS) that prescribes the storage, syntax, and se mantics of supported geometric datatypes
A spatial indexing mechanism
A set of operators and functions for performing area-of-interest queries, spatial join queries, and other spatial analysis operations
< /li>Administrative utilities