Journal:Proceedings of the 29th ACM SIGKDD Conference on Knowledge Discovery and Data Mining (KDD), CCF-A
Abstract:Semi-supervised text classification (STC) has been extensively researched and reduces human annotation. However, existing research assuming that unlabeled data only contains in-distribution texts is unrealistic. This paper extends STC to a more practical Open-set Semi-supervised Text Classification (OSTC) setting, which assumes that the unlabeled data contains out-of-distribution (OOD) texts. The main challenge in OSTC is the false positive inference problem caused by inadvertently including OOD texts during training. To address the problem, we first develop baseline models using outlier detectors for hard OOD-data filtering in a pipeline procedure. Furthermore, we propose a Latent Outlier Softening (LOS) framework that integrates semi-supervised training and outlier detection within probabilistic latent variable modeling. LOS softens the OOD impacts by the Expectation-Maximization (EM) algorithm and weighted entropy maximization. Experiments on 3 created datasets show that LOS significantly outperforms baselines.
Co-author:Junfan Chen,Richong Zhang, Junchi Chen,Chunming Hu, Yongyi Mao
Indexed by:国际学术会议
Page Number:226-236
Translation or Not:no
Date of Publication:2023-01-01
