Journal:Proceedings of the ACM on Web Conference 2024 (WWW), CCF-A
Abstract:Text classification is one vital tool assisting web content mining. Semi-supervised text classification (SSTC) offers an approach to alleviate the burden of annotation costs by training on a few labeled texts alongside many unlabeled texts. Unsolved challenges in SSTC are the overfitting problem caused by the limited labeled data and the mislabeling problem of unlabeled texts. To address these issues, this paper proposes a Self-Paced Pair-Wise representation learning (SPPW) model. Concretely, SPPW alleviates the overfitting problem by replacing the overfitting-prone learning of a parameterized classifier with representation learning in a pair-wise manner. Besides, we propose a novel self-paced text filtering method that effectively integrates both label confidence and text hardness to reduce mislabeled texts synergistically. Extensive experiments on 3 benchmark SSTC datasets show that SPPW outperforms baselines and is effective in mitigating overfitting and mislabeling problems.
Co-author:Junfan Chen,Richong Zhang, Jiarui Wang,Chunming Hu, Yongyi Mao
Indexed by:国际学术会议
Page Number:4352-4361
Translation or Not:no
Date of Publication:2024-01-01
