Theses and Dissertations

Comparing the Effect of Smoothing and N-gram Order : Finding the Best Way to Combine the Smoothing and Order of N-gram

Wenyang Zhang, Florida Institute of Technology

Date of Award

3-2015

Document Type

Thesis

Degree Name

Master of Science (MS)

Department

Computer Engineering and Sciences

First Advisor

Veton Kepuska

Second Advisor

Samuel Kozaitis

Third Advisor

Carlos Otero

Fourth Advisor

Eraldo Ribeiro

Abstract

The SRILM is a toolkit for building and applying statistical language models (LMs), designed and developed primarily for use in speech recognition, statistical tagging and segmentation, and machine translation. It has been under development in the SRI Speech Technology and Research Laboratory since 1995. The toolkit has also greatly benefited from its use and enhancements during the Johns Hopkins University/CLSP summer workshops in 1995, 1996, 1997, and 2002. In this thesis, the effect of smoothing and order of N-gram for language model we build by srilm toolkit is studied. My primary method is to use comparison. Firstly, training corpus and testing corpus in website is downloaded. This should be checked in all of the document. Then, I use command window and training corpus to train a language model in different smoothing and order of n-gram and test another one we downloaded in website. Finally, I will get the perplexities which can weigh the language model. I will also list every perplexity and compare them in different smoothing and order of n-gram to see which language model we built has minimal perplexity. Then, we will knwhich language model we built is the best one. Also, I will do it again by another two different corpora, one for training, another for testing, to see the effect of different corpus for language model. If the two group perplexity is the same, it means the different corpus do not affect perplexity. Otherwise, the result is opposite. In conclusion, my measure above all is to calculate perplexity of each language model in different smoothing and order of n-gram and compare every perplexity to find the best way to match the smoothing and order of n-gram for the language model. At the same time, we will know the effect of different corpus for the language model with same smoothing and order of n-gram.

Recommended Citation

Zhang, Wenyang, "Comparing the Effect of Smoothing and N-gram Order : Finding the Best Way to Combine the Smoothing and Order of N-gram" (2015). Theses and Dissertations. 702.
https://repository.fit.edu/etd/702

Download

Included in

Computer Engineering Commons

COinS

Theses and Dissertations

Comparing the Effect of Smoothing and N-gram Order : Finding the Best Way to Combine the Smoothing and Order of N-gram

Date of Award

Document Type

Degree Name

Department

First Advisor

Second Advisor

Third Advisor

Fourth Advisor

Abstract

Recommended Citation

Included in

Search

Browse

Author Corner

Theses and Dissertations

Comparing the Effect of Smoothing and N-gram Order : Finding the Best Way to Combine the Smoothing and Order of N-gram

Author

Date of Award

Document Type

Degree Name

Department

First Advisor

Second Advisor

Third Advisor

Fourth Advisor

Abstract

Recommended Citation

Included in

Share

Search

Browse

Author Corner