导航
登录 English
陈俊帆
点赞:
陈俊帆
点赞:
论文
Open-Set Semi-Supervised Text Classification with Latent Outlier Softening
发布时间:2025-10-22点击次数:
发表刊物: Proceedings of the 29th ACM SIGKDD Conference on Knowledge Discovery and Data Mining (KDD), CCF-A
摘要: Semi-supervised text classification (STC) has been extensively researched and reduces human annotation. However, existing research assuming that unlabeled data only contains in-distribution texts is unrealistic. This paper extends STC to a more practical Open-set Semi-supervised Text Classification (OSTC) setting, which assumes that the unlabeled data contains out-of-distribution (OOD) texts. The main challenge in OSTC is the false positive inference problem caused by inadvertently including OOD texts during training. To address the problem, we first develop baseline models using outlier detectors for hard OOD-data filtering in a pipeline procedure. Furthermore, we propose a Latent Outlier Softening (LOS) framework that integrates semi-supervised training and outlier detection within probabilistic latent variable modeling. LOS softens the OOD impacts by the Expectation-Maximization (EM) algorithm and weighted entropy maximization. Experiments on 3 created datasets show that LOS significantly outperforms baselines.
合写作者: 陈俊帆,张日崇, Junchi Chen,胡春明, Yongyi Mao
论文类型: 国际学术会议
页面范围: 226-236
是否译文:
发表时间: 2023-01-01