Date of Award

5-2020

Document Type

Dissertation

Degree Name

Master of Science (MS)

Department

Computer Engineering and Sciences

First Advisor

Eraldo Ribeiro

Second Advisor

Ronaldo Menezes

Third Advisor

Susan Earles

Fourth Advisor

Veton Kepuska

Abstract

Natural language processing (NLP) techniques have been through many advancements in recent years, linguistics and scientist utilized these techniques to solve many challenges related to written language and literary. Problems such as finding the genetic relationships among languages, attributing author of a text and categorizing text by genre have been treated throughout the years using conventional statistical methods, for instance, bag of words (BoW), N-gram, the frequency of words and the lexical distance between words. By considering written language as a complex system, network science tools and techniques can be used to address those problems. A unified methodology is proposed in this dissertation to achieve this task by (i) Propose a framework for characterizing written language as a complex system; (ii) Define three language related fields that need to be addressed by the proposed methodology; and (iii) For each field: Review related literature to get a solid background of the subject; Collect and process the data then construct the networks; Extract network measures and statistics to build the dataset; Deploy machine learning algorithms to cluster, classify the datasets; Compare and contrast results obtained with one from traditional methods.

Share

COinS