Date of Award

12-2022

Document Type

Dissertation

Degree Name

Doctor of Philosophy (PhD)

Department

Computer Engineering and Sciences

First Advisor

Philip Chan

Second Advisor

Georgios Anagnostopoulos

Third Advisor

Debasis Mitra

Fourth Advisor

Marius C. Silaghi

Abstract

As machine learning models have achieved great success in various research and industry fields, the success of these models heavily relies on the massive amount of data collection and human annotations. While the real world is an open set, the daily emerged categories and the lacking of annotations have become new challenges for machine learning models. The absence of newly emerged categories in training samples can be captured by Open Set Recognition (OSR). Then, given the newly emerged samples, the process of automatically identifying the novel categories is called Novel Category Discovery (NCD). In this dissertation, we focused on learning the representations for OSR and NCD. To learn the representations for OSR, we first introduce an extension called Min Max Feature (MMF) that can be incorporated into different loss functions to find more discriminative representations. Our evaluation shows that the proposed extension can significantly improve the OSR performances of different types of loss functions. Then, we propose a self-supervision method Detransformation Autoencoder (DTAE), for the OSR problem in the image dataset. This proposed method engages in learning representations that are invariant to the transformations of the input data. Next, to extend DTAE to the graph dataset, we present two transfor

mations (FCG-shift and FCG-random) for the Function Call Graph (FCG) based malware representations to facilitate the pretext task. The experiment results indicate that our proposed pre-training process can improve different performances of different downstream loss functions for the OSR problem in both image and graph datasets. To tackle the problem of NCD under an open-set scenario, we propose General Intra-Inter (GII) loss to learn a representation space that clusters the unlabeled samples as novel categories, meanwhile maintaining sensitivity to the unknown category. Our evaluation of image and graph datasets shows that GII outperforms other approaches in NCD and OSR.

Share

COinS