Date of Award

12-2001

Document Type

Dissertation

Degree Name

Doctor of Philosophy (PhD)

Department

Computer Engineering and Sciences

First Advisor

Ryan Stansifer

Second Advisor

Phil Bernhard

Third Advisor

James Whittaker

Fourth Advisor

Gary Howell

Abstract

Recent and (continuing) rapid increases in computing power now enable more of humankind's written communication to be represented as digital data. The most recent and obvious changes in multilingual information processing have been the introduction of larger character sets encompassing more writing systems. Yet the very richness of larger collections of characters has made the interpretation and processing of text more difficult. The many competing motivations (satisfying the needs of linguists, computer scientists, and typographers) for standardizing character sets threaten the purpose of information processing: accurate and facile manipulation of data. Existing character sets are constructed without a consistent strategy or architecture. Complex algorithms and reports are necessary now to understand raw streams of characters representing multilingual text. We assert that information processing is an architectural problem and not just a character set problem. We analyze several multilingual information processing algorithms (e.g., bidirectional reordering and character normalization) and we conclude that they are more dangerous than beneficial. The countless number of unexpected interactions suggest a lack of a coherent architecture. We introduce abstractions, novel mechanisms, and take the first steps towards organizing them into a new architecture for multilingual information processing. We propose a multilayered architecture which we call Metacode where character sets appear in lower layers and protocols and algorithms in higher layers. We recast bidirectional reordering and character normalization in the Metacode framework.

Share

COinS