Leveraging Machine Learning and Process-Based Models for Soil Organic Carbon Prediction: A Comparative Study and the Role of ChatGPT in Soil Science

In recent years, ML algorithms have increasingly been recognized in ecological modeling, including predicting soil organic carbon (SOC). However, their application on smaller datasets typical of long-term soil research has yet to be extensively evaluated, particularly in comparison to traditional process-based models. A study conducted in Austria compared ML algorithms like Random Forest and Support Vector Machines against process-based models such as RothC and ICBM, using data from five long-term experimental sites. The findings revealed that ML algorithms performed better when large datasets were available. Still, their accuracy declined with smaller training sets or more rigorous cross-validation methods like leave-one-site-out. While requiring careful calibration, process-based models better understand the biophysical and biochemical mechanisms underlying SOC dynamics. The study thus recommended combining ML algorithms with process-based models to leverage their respective strengths for robust SOC predictions across different scales and conditions.

SOC is vital for soil health, so maintaining and increasing SOC levels are essential for boosting soil fertility, improving resilience to climate change, and reducing carbon emissions. We need dependable monitoring systems and predictive models to achieve these objectives, especially in light of changing environmental conditions and land-use practices. ML and process-based models both play critical roles in this endeavor. ML is particularly useful with large datasets, while process-based models provide comprehensive insights into soil mechanisms. By combining these approaches, we can mitigate the shortcomings of each and achieve more precise and adaptable predictions, which are crucial for effective soil management and environmental conservation worldwide.

Methods and Materials:

The study utilized data from five long-term field experiments across Austria, spanning various management practices aimed at SOC accumulation. These experiments covered 53 treatment variants and provided detailed information on soil characteristics, climate data, and management practices. The Soil samples were collected from 0-25 cm, depending on the site. Daily climate data, including temperature, precipitation, and evaporation, were sourced from high-quality datasets. Process-based SOC models like RothC, AMG.v2, ICBM, and C-TOOL were employed alongside machine learning algorithms (Random forest, SVMs, Gaussian process regression) for predicting SOC dynamics.

Research Methodology Overview:

The research conducted between February 25th and March 5th, 2023, evaluated ChatGPT’s ability to answer fundamental questions in modern soil science. Four ChatGPT responses were assessed: Free ChatGPT-3.5, short and long answers from paid ChatGPT-3.5 (Pro-a and Pro-b), and reactions from paid ChatGPT-4.0. Responses were initiated with a prompt to “Act as a soil scientist,” and if timed out, followed by “Continue.” The expert evaluation involved five specialists rating answers on a scale of 0 to 100, with final scores averaged. Additionally, a Likert Scale survey gathered perceptions from 73 soil scientists regarding ChatGPT’s knowledge and reliability, yielding responses from 50 participants for analysis.

Summary of SOC Sequestration and Modeling Approaches:

The observed annual sequestration rates at five Austrian sites align with other studies and cover a range of soil and climate conditions typical for Central-Eastern Europe. The study found that certain ML algorithms, like Random Forest and SVM with a polynomial kernel, outperformed process-based models due to their ability to capture non-linear relationships. Combining ML with process-based models improved predictions. For robust SOC modeling, uncalibrated models are recommended when data is scarce, calibrated models with cross-validation when data is adequate, and ML models when data is abundant. Accurate SOC modeling necessitates comprehensive, long-term datasets encompassing various agricultural practices and conditions.

Perceptions and Contributions of ChatGPT in Soil Science:

A study exploring the perceptions of Indonesian soil scientists towards ChatGPT revealed significant findings. Predominantly, the community consists of 64% males and 36% females, with the majority (88%) having formal education in soil science. Most respondents (76%) know ChatGPT and 60% have used it, primarily valuing its potential to aid in research and academic writing. While 86% do not consider ChatGPT fraudulent, they agree it requires verification and paraphrasing before use in scientific contexts. ChatGPT-4.0 was rated highly for its accuracy in providing relevant answers, particularly in English. Despite confidence in ChatGPT’s potential to advance soil science, the respondents emphasize the necessity for human oversight to ensure the tool’s responsible and effective use.

                                                             Image source

Conclusions on the Use of ChatGPT in Soil Science and Machine Learning for SOC Prediction:

The research highlights the valuable role of ChatGPT and ML in soil science. Indonesian soil scientists express over 80% trust in ChatGPT, favoring ChatGPT-4.0 for its superior accuracy in aiding research and education, though the free and paid versions of ChatGPT-3.5 are also considered reliable. However, the perceived accuracy of ChatGPT responses is generally 55%, indicating room for future improvements. Concurrently, non-linear ML models, especially when combined with process-based models like Random Forest, show promise in predicting SOC dynamics, particularly in datasets from long-term agricultural studies. Integrating ML with expert knowledge could enhance the precision of SOC forecasts, underlining the importance of human oversight and model refinement.