Kim, E., J. Moon, J. Shim, and E. Hwang. 2024. Predicting invasive species distributions using incremental ensemble-based pseudo-labeling. Ecological Informatics 79: 102407. https://doi.org/10.1016/j.ecoinf.2023.102407
Control strategies for preventing the spread of invasive species require their accurate geographical distribution. Species distribution models (SDMs) that can predict potential habitats of invasive species and thereby derive habitat suitability maps have become a valuable tool for supporting regulatory strategies. To date, machine learning (ML)-based approaches have outperformed profile and statistical-based approaches in terms of species distribution prediction accuracy. However, ML-based approaches often suffer from poor predictive performance when there is insufficient labeled data. Recently, pseudo-labeling (PL), a semi-supervised learning method, has been proposed to alleviate this problem, but pseudo-labels generated using a single-teacher model with very few labeled data points are generally biased and not suitable for training SDMs. In this paper, we propose a novel prediction scheme for invasive species distributions using incremental ensemble-based PL (SDP-EPL). We first build an ensemble-based teacher using multiple conventional SDMs and then incrementally construct a training set, starting with very few labeled data points, by repeating the following process: (i) generating pseudo-labels for unlabeled data using the teacher model, (ii) appending the pseudo-labeled data representing high or low habitat suitability to the training set, and (iii) training the teacher model using the updated training set. We then train a student SDM using the training set. Based on extensive experiments using citizen science datasets for three species, we show that the proposed scheme outperforms other commonly used SDMs in terms of diverse evaluation metrics and achieves performance improvements of up to 14.61% and 5.45% compared to the baseline and state-of-the-art models, respectively. We also demonstrate the effectiveness of the ensemble teacher model and incremental labeling in terms of predictive accuracy.