The production capacity of shale oil reservoirs after hydraulic fracturing is influenced by a complex interplay involving geological characteristics, engineering quality, and well conditions. These relationships, nonlinear in nature, pose challenges for accurate description through physical models. While field data provides insights into real-world effects, its limited volume and quality restrict its utility. Complementing this, numerical simulation models offer effective support. To harness the strengths of both data-driven and model-driven approaches, this study established a shale oil production capacity prediction model based on a machine learning combination model. Leveraging fracturing development data from 236 wells in the field, a data-driven method employing the random forest algorithm is implemented to identify the main controlling factors for different types of shale oil reservoirs. Through the combination model integrating support vector machine (SVM) algorithm and back propagation neural network (BPNN), a model-driven shale oil production capacity prediction model is developed, capable of swiftly responding to shale oil development performance under varying geological, fluid, and well conditions. The results of numerical experiments show that the proposed method demonstrates a notable enhancement in R^{2} by 22.5% and 5.8% compared to singular machine learning models like SVM and BPNN, showcasing its superior precision in predicting shale oil production capacity across diverse datasets.

The prediction and evaluation of horizontal well production capacity is the key to the rational and effective development of shale oil reservoirs [

In the evaluation process of shale oil production capacity, understanding the extent of influence and interrelationships among various factors is crucial [

Machine learning (ML) excels in extracting information from high-dimensional, complex data, playing a significant role in predicting reservoir production capacity automatically in supervised or unsupervised modes [

This study utilizes the random forest method to assess the main controlling factors influencing shale oil production capacity across various reservoirs. It leverages drilling, fracturing, and well-testing data from 236 horizontal wells subjected to volume fracturing in the R1 and R2 reservoirs. Additionally, it constructs a sample set for predicting production capacity through numerical simulation models of shale oil reservoir seepage. Employing a machine learning combination model with support vector machine and BP neural network as base learners, the study establishes a prediction model specifically targeting accurate estimates of production capacity for fractured horizontal wells.

Machine learning algorithms, particularly Random Forest (RF), have increasingly been applied to evaluate the main controlling factors of shale oil production capacity. This algorithm, a key representative of bagging ensemble methods, efficiently handles typical datasets in shale oil production capacity analysis, offering extensive application in feature screening and prediction.

The dataset in question, designated as X = {X_{1}, X_{2}, ..., X_{N}}, where each X_{i} = {x_{i,1}, x_{i,2}, ..., x_{i,M}}, comprises N instances, each constituted by M variables. The output dataset is denoted as Y = {Y_{1},Y_{2}, ...,Y_{N}}. Given a partition variable

where _{s}. The CART algorithm endeavors to discern the optimal partition feature and its value that bifurcates D_{s} into two subsets, aiming to minimize the loss function. Upon completion of CART training, the amalgamation of diverse CARTs constitutes the RF model [

Upon finalization of the RF model, it becomes feasible to calculate the significance or influence of each input variable. During the individual CART training, N instances are randomly selected with replacement from the dataset, leaving a subset of unchosen instances known as Out of Bag (OOB) samples. Following training, these OOB samples serve to evaluate the impact of input variables on the RF model. Assuming the RF model incorporates T CARTs, and defining the OOB baseline error for the t-th CART as

Aggregating the decrements in accuracy for the j-th variable across all CARTs yields the mean decrement:

This mean is normalized to represent the variable’s contribution to the model’s performance regarding corrosion, formulated as:

where

Through the computation of the Gini coefficient for each decision tree node, RF efficiently assesses the relative significance of different factors, facilitating informed decision-making processes. Its ability to mitigate overfitting and accommodate diverse data types makes it an ideal tool for identifying the main controlling factors influencing shale oil production capacity.

Support Vector Machine (SVM) is a generalized linear classifier and regressor [

In regression analysis, after mapping samples into higher dimensions, it is essential to identify a separating line (hyperplane) that minimizes the distance between two sample points furthest apart (perpendicular to the hyperplane). This approach aims to predict sample outcomes based on their proximity to the hyperplane. SVM employs the Radial Basis Function (RBF) kernel to map samples into higher dimensions, following the formula:

where

During model training, the model uses a loss function with L2 regularization (which reduces overfitting):

where

The Back Propagation Neural Network (BPNN) stands as one of the fundamental models in machine learning [

The BPNN model undergoes training across multiple sets of data samples, adapting and determining the connection weights of individual neurons by assessing the variance between actual and expected output values. This iterative process culminates in accurately predicting target parameters. In the context of yield forecasting, inputting various feature parameters from individual wells, alongside specified parameters like network layers, the number of neurons in hidden layers, learning rate, and iteration count, yields the predicted production capacity output. The topological structure of the BPNN is depicted in

In theory, with a sufficient number of hidden layers and an adequate amount of training data, it is plausible to approximate any equation. The fundamental process of training a BPNN involves:

① Forward propagation from the input layer, where node outputs are computed, based on initialized parameters (random initialization) and activation functions, progressing until the output layer, subsequently calculating the error.

② Employing an optimizer to refine the error by utilizing backpropagation via the chain rule for derivative calculation, thereby updating weights and biases until meeting termination conditions.

Because various machine learning models operate on different principles, their efficacy in uncovering latent information within data also varies. These models are not mutually exclusive but rather complementary and can be interconnected. Relying solely on a single predictive model to determine hyperparameters for different datasets might introduce significant biases in prediction errors. Dismissing the use of the model could result in losing valuable insights it could have extracted, potentially leading to larger errors. Hence, ensemble models represent one effective approach to enhancing prediction accuracy. One critical issue in the machine learning combination model is how to construct it, and common methods for constructing the model include concatenated and parallel combination models.

(1) Concatenated combination model

This model involves utilizing the output of one model as the input for the next model. Each layer in the model can process and transform input data, extracting more useful features. The basic framework of the model is illustrated in

(2) Parallel combination model

In this model, predictions from different models are combined through weighted averaging, where appropriate weights are assigned to each model’s prediction before merging. This approach yields more precise predictions, effectively handling generalization errors and overfitting. Simultaneously, it maintains higher accuracy and reliability. The basic framework of the model is illustrated in

Compared to the concatenated combination model, the parallel combination model exhibits higher stability, operational efficiency, and stronger scalability. Consequently, for this study, a parallel combination model is chosen to establish the shale oil production capacity prediction model.

Furthermore, the final predictive performance of combination models closely correlates with weight determination. Different methods for calculating weights vary in difficulty and effectiveness. Common methods for weight determination include equal-weight averaging, error variance weighted averaging, and inverse relative error methods. To fully showcase the advantages of various machine learning algorithms in handling complex shale oil production capacity influencing factors, this study adopts the inverse relative error method to determine weight coefficients. The principle behind this method lies in utilizing relative errors to assess the predictive accuracy of models. Larger relative errors imply lower significance within the ensemble model, thus receiving lower weights. Conversely, smaller relative errors indicate better predictive performance, thereby obtaining higher weights within the ensemble predictive model. The formula for this method is:

where

When establishing a mathematical model for the three-phase flow of oil, gas, and water within shale oil reservoirs, several fundamental assumptions are typically made owing to the low density of shale oil, its tendency to contain dissolved gas, and the potential involvement of formation water in flow. These assumptions generally include: (1) The flow within the shale oil reservoir is isothermal; (2) Flow considerations for the oil and water phases account for threshold pressure gradients and stress sensitivity, while gas phase flow considers stress sensitivity only; (3) Hydrocarbons within the shale oil reservoir consist solely of oil and gas components, with the oil component exclusively present within the oil phase and the gas component able to exist in both the gas and oil phases; (4) Oil and gas are immiscible with water.

The permeability of shale oil reservoirs is extremely low, often in the nanodarcy range, showcasing distinct nonlinear characteristics in flow. The flow equation that incorporates threshold pressure gradients is given by:

where

Additionally, during the production of shale oil, the decline in reservoir pressure leads to an increase in effective stress on the rock, consequently causing a reduction in reservoir permeability. The relationship between shale permeability and effective stress can be expressed as:

where

where

Furthermore, to close the system of equations, the following auxiliary equations are required:

where

This study included 225 wells from the R1 reservoir and 11 wells from the R2 reservoir. The field data associated with these wells is used to establish a data-driven approach for screening the main controlling factors affecting shale oil production capacity.

To precisely evaluate well performance and determine potential production, the analysis is conducted across two dimensions: Reservoir properties and well parameters. Reservoir properties such as porosity, oil saturation, permeability, density, and clay content are considered, directly impacting the storage and flow of hydrocarbons. Well parameters encompass specific configurations and operational details like the length of horizontal and oil layer segment length. With practical production cycles in mind, a three-month production period is selected as the benchmark for production capacity assessment, reflecting initial production conditions and providing ample data support.

A detailed data analysis, as shown in ^{³} with an average of 2.50 g/cm^{³}, while R2’s is slightly higher, ranging from 2.52 to 2.66 g/cm^{³} with an average of 2.55 g/cm^{³}. Additionally, variations are observed in clay content, permeability, porosity, and oil saturation between the two reservoirs. Overall, R1 slightly outperforms R2 in static reservoir evaluation parameters.

Evaluation dimension | Evaluation parameter | R1 reservoir | R2 reservoir |
---|---|---|---|

Reservoir properties | Reservoir density (g/cm^{3}) |
2.38–2.61 (2.50) | 2.52–2.66 (2.55) |

Clay content (%) | 11.2–49.4 (19.0) | 14.3–35.5 (19.10) | |

Permeability (mD) | 0.0006–0.0047 (0.0016) | 0.0075–0.0447 (0.017) | |

Porosity (%) | 2.25–5.44 (4.12) | 2.91–6.08 (4.18) | |

Oil saturation (%) | 27.66–63.52 (49.84) | 28.40–58.44 (50.30) | |

Well parameters | Length of horizontal section (m) | 184–3035 (1239) | 120–4035 (1291) |

Oil layer segment length (m) | 140–2528 (1031) | 103–3223 (1060) | |

Production capacity (t/d) | 22.5–192.6 (112.5) | 55.8–118.8 (83.7) |

In well parameters, R1’s horizontal section length ranges from 184 to 3035 m, averaging 1239 m, compared to R2’s 120.0–4035.0 m, with an average of 1291 m. Differences in the oil layer segment length are also evident.

Crucially, in terms of production capacity, R1 ranges from 22.5 to 192.6 t/d, with an average of 112.5 t/d, significantly higher than R2’s 55.8 to 118.8 t/d, averaging 83.7 t/d. In summary, R1 exhibits a marginally superior performance across multiple evaluation dimensions, particularly in production capacity, providing essential data for further development of the shale oil field.

The factors influencing the horizontal well fracturing for shale oil can be broadly categorized into two types: A and B. Type A factors primarily consist of static parameters that determine the maximum potential production capacity in an ideal state. These include the size of the seepage area, the initial energy of the formation, and the formation’s seepage capability. In theory, type A represents the maximum potential production capacity under ideal conditions, assuming optimal reservoir working conditions without any non-physical interference.

Type B factors include dynamic and static parameters that reflect the depletion process of formation energy. They involve critical elements such as the elastic energy of the oil and the production rate. These core parameters reflect reservoir information and production dynamics, essential in the prediction process of early production capacity following shale oil horizontal well fracturing.

Given that type A encompasses numerous parameters, applying redundant parameters in production capacity prediction can degrade the predictive outcome. Therefore, it is crucial to select the main controlling factors from type A to ensure the accuracy of prediction results. In contrast, type B factors, reflecting essential reservoir and production dynamics, are indispensable in the early production capacity prediction process and do not require the same level of screening as type A.

Based on the field data of each well, this study uses an RF algorithm to screen the main controlling factors of the production capacity of different typical shale reservoirs. It can identify the factors that have the greatest impact on the shale oil production capacity under different geological, fluid, and well conditions and realize the data-driven main controlling factor screening. RF is an ensemble learning method mainly used for classification, regression, and feature selection, which has been widely adopted in petroleum engineering because of its robustness, accuracy, and ease of use.

Comprising numerous decision trees, each trained on a randomly selected subset of the data, RF finalizes its predictions through a voting or averaging process of all individual tree outcomes. This methodology excels at managing vast quantities of input variables and evaluating the significance of each, proving invaluable in multifactorial analyses where feature selection is crucial.

Based on the above methods, the main controlling factors affecting the fracturing development of different typical shale oil reservoirs are determined. In the R1 reservoir, the production capacity is predominantly governed by geological parameters, with the oil saturation, permeability, length of horizontal section, and reservoir density exerting a significant influence on production capacity. The detailed influence weights for the R1 reservoir are shown in

Conversely, the R2 reservoir exhibits inferior overall physical properties, leading to initial production capacity that is more substantially impacted by well parameters. Factors such as the oil layer segment length, length of horizontal section, permeability, and oil saturation are observed to have a considerable effect on production capacity levels. The detailed influence weights for the R2 reservoir are shown in

This study employs the reservoir numerical simulation method to construct a sample set for production capacity prediction, thereby establishing a model-driven approach for predicting shale oil production capacity. Focusing on the R1 and R2 reservoirs, a numerical simulation of horizontal well fracturing development considering threshold pressure gradients and stress sensitivity is carried out. Utilizing the E100 module of the Eclipse reservoir simulation software, a grid system for the foundational model is meticulously established. The model adheres to the Black Oil model, employing a block-centered grid within a Cartesian coordinate system. The grid system of the model comprises dimensions of 219 × 73 × 20, with grid steps of 20 m in both the X and Y directions, and a 1 m step in the Z direction.

The specific attributes of the model include a vertical-to-horizontal permeability ratio of 0.1, with additional detailed parameters presented in

Parameter name | Parameter value | Parameter name | Parameter value |
---|---|---|---|

Oil saturation (%) | 50 | Number of fractures | 36 |

Comprehensive compressibility (1/MPa) | 0.0067 | Fracture width (m) | 0.006 |

Fracture half-length (m) | 110 | Fracture conductivity (mD·m) | 300 |

Matrix permeability (mD) | 0.0011 | Matrix porosity (%) | 5 |

Oil volume factor | 1.11 | Oil viscosity (mPa·s) | 1.89 |

The 3D distribution of oil saturation for the shale oil reservoirs is depicted in

Then, we encode an automated invocation program for the numerical simulator to build the production capacity prediction sample set efficiently. The framework of the program is shown in

In this study, the automated invocation program for numerical simulators is used to simulate the production dynamics of R1 and R2 reservoirs for 800 and 300 times, respectively. Subsequently, data normalization procedures are applied to ensure the compatibility of the production capacity sample set of typical shale oil reservoirs with various prediction models, including the SVM algorithm, BPNN, and their combination model.

Due to the disparate dimensions and significant magnitude differences among the aforementioned input parameters, the complexity of model construction is increased, potentially leading to a decrease in accuracy. To circumvent numerical and dimensional issues arising from variations between different parameters, data preprocessing is commonly employed. In this study, the Z-Score normalization method is utilized to process the input parameters. Models built with standardized data exhibit faster execution speeds and yield superior predictive performance. The formula for the Z-Score normalization is given below:

where

After completing the above operations, the production capacity prediction sample set has been established and can be used in the training process of the model.

This study aims to establish a production capacity prediction model using machine learning methods in typical shale oil reservoirs. The model employs the main controlling factors affecting production capacity undergoing hydraulic fracturing as input data. The average daily oil production three months post-production serves as the output data. The objective is to enable a comprehensive, multifactorial prediction of shale oil production capacity under various geological, fluid, and hydraulic fracturing development conditions, facilitating a swift response to the potential of shale oil hydraulic fracturing development.

According to the screening results of the main controlling factors, it is determined that This study aims to establish a production capacity prediction model using machine learning methods in typical shale oil reservoirs. The model employs the main controlling factors affecting production capacity undergoing hydraulic fracturing as input data. The average daily oil production three months post-production serves as the output data. The objective is to enable a comprehensive, multifactorial prediction of shale oil production capacity under various geological, fluid, and hydraulic fracturing development conditions, facilitating a swift response to the potential of shale oil hydraulic fracturing development.

According to the screening results of the main controlling factors, it is determined that oil saturation, permeability, length of the horizontal section, and reservoir density are the main controlling factors affecting the production capacity of the R1 reservoir. The oil layer segment length, length of the horizontal section, permeability, and oil saturation are the main factors affecting the production capacity of the R2 reservoir. These main controlling factors serve as the input parameters for type A in the R1 reservoir and R2 reservoir production capacity prediction model, employed for predicting the production capacity of typical shale oil. Meanwhile, type B encompasses production rate, reservoir pressure, and the gas-oil ratio of dissolved gas, adding three more factors to be incorporated into the prediction model. Consequently, the input parameters for the R1 reservoir and R2 reservoir production capacity prediction model contain seven factors respectively. These factors together with the outputs (shale oil production capacity) constitute the shale oil production capacity prediction sample set.

To conduct a training process on the production capacity prediction sample set for the shale oil reservoir, it is necessary to partition the samples. They are divided into training, testing, and validation sets in certain proportions, serving the purposes of training the shale oil production capacity prediction model, optimizing various machine learning algorithm hyperparameters, and verifying the model’s training outcomes. In this study, 800 numerical simulation runs are generated for the R1 reservoir by randomly matching different geological, fluid, and hydraulic fracturing development parameters. Specifically, 640 models (80% of the sample set) are utilized to train the shale oil production capacity prediction model based on the loss function, 80 models (10% of the sample set) are used to adjust different machine learning algorithm hyperparameters, and the final 80 models (10% of the sample set) are employed to validate the predictive performance of the model. Additionally, to showcase the impact of sample size on various machine learning methods, 300 numerical simulation runs are created for the R2 reservoir. These are split into 240 models for training, 30 for testing, and 30 for validation purposes.

In tackling the regression problem with multiple input parameters, this study employs two distinct mainstream machine learning methods and their combination to construct prediction models: SVM and BPNN. Given that the objective of this study is to establish a shale oil production capacity prediction model applicable to scenarios with varying sizes of field data, it is noteworthy that BPNN, known for its capability in handling high-dimensional and large-scale datasets, can capture complex nonlinear relationships. SVM, on the other hand, demonstrates clear superiority in handling small sample datasets compared to other ML methods. Therefore, for the shale oil production capacity prediction problem addressed in this study, a parallel combination model integrating SVM and BPNN is chosen to leverage their respective strengths.

The primary step involves determining the pertinent parameters of these diverse machine-learning methods to suit the sample set. These encompass standard parameters (like weights and biases) and hyperparameters (such as neural network layers). Standard parameters are typically resolved through routine learning and training, while hyperparameters are usually optimized using manual or grid search methods to compare the performance of network models on the validation set under various parameter combinations, thereby facilitating optimal selection. However, these conventional methods often encounter difficulties such as slow convergence and susceptibility to local optima. The Particle Swarm Optimization (PSO) algorithm exhibits robust self-organizing learning capabilities, strong global search prowess, swift convergence, and ease of parameter implementation. Consequently, in constructing a prediction model, the PSO algorithm is employed to optimize the hyperparameters of SVM and BPNN.

Furthermore, to assess the accuracy of the selected machine learning methods in predicting production capacity and their model generalization, this study employs the determination coefficient (R^{²}), mean squared error (MSE), and mean absolute error (MAE) to evaluate the predictive performance of the shale oil production capacity model. The calculation process for each is depicted respectively in

where

Forecasting production capacity is crucial for assessing new well productivity and evaluating economic returns. Establishing a shale oil production capacity prediction model based on geological, fluid, and hydraulic fracturing development parameters enables approximating well productivity using production decline patterns. This approach aids in promptly adjusting development strategies and production strategies for wells, optimizing their performance and operational efficacy.

Based on the geological and engineering factors that influence production capacity, this study employs the SVM algorithm, BPNN, and their combination model to predict the production capacity of typical shale oil reservoirs. The R^{2}, MSE, and MAE are used to evaluate the performance of different models. By parallelly connecting the SVM algorithm and BPNN, a combination prediction model is constructed. The performance of individual prediction models and the combination prediction model in different typical shale oil reservoirs is then compared and analyzed.

Firstly, a shale oil production capacity prediction model based on the SVM algorithm is established, followed by hyperparameter optimization using the testing set. A crucial aspect of SVM regression is the selection of the kernel function type, which includes linear, polynomial, sigmoid, and Gaussian RBF kernel functions. Among these, the RBF kernel function is widely applied. While selecting a linear kernel can mitigate extensive computational requirements when the sample set features high-dimensional characteristics and a sufficient quantity of samples, the RBF kernel function consistently demonstrates excellent performance, irrespective of sample size or data dimensionality. Additionally, another critical hyperparameter in SVM is the penalty factor C, which determines the degree of loss for outliers. Smaller C values correspond to smaller losses in the objective function. In this study, the SVM hyperparameters resulting from PSO optimization are determined as RBF kernel functions with C = 3. Following modeling, the SVM-based shale oil production capacity prediction model is validated on the R1 and R2 reservoirs. Comparisons between predicted and actual results on the training and validation sets are depicted in ^{2} values of 0.74 and 0.82, respectively.

The inclusion of a regularization term in SVM helps mitigate overfitting caused by substantial randomness in shale oil development data, as well as complex main controlling factors affecting production, contributing to the model’s robust generalization ability. Particularly, when dealing with numerous well parameters with weak interdependencies, SVM demonstrates good predictive performance without further feature selection, especially in scenarios with high-dimensional features and limited sample size. In this study, compared to the R1 reservoir, SVM exhibits better performance when applied to the R2 reservoir. This is attributed to SVM’s superior performance in handling small sample datasets, allowing for smooth training on the 240 samples (training set for R2 reservoir) available in this dataset. Consequently, SVM becomes suitable for production capacity prediction in situations involving a multitude of features and poor parameter independence.

Subsequently, a shale oil production capacity prediction model is developed based on the BPNN, employing a testing set to determine the network’s hyperparameters such as the number of hidden layers and neurons, further optimized using the PSO algorithm. The optimal hyperparameter combination yielded a model with three hidden layers and ten neurons. To map features into higher dimensions for fitting nonlinear functions, the model employs the ReLU activation function during weight propagation from input to hidden layers, facilitating the model’s ability to capture nonlinear patterns. Following model construction, the BPNN-based shale oil production capacity prediction model is validated on the R1 and R2 reservoirs. Comparisons between predicted and actual results on the training and validation sets are presented in ^{2} values of 0.86 and 0.77, respectively.

On the R1 reservoir, the abundance of data used for training the BPNN allows for a robust fitting of nonlinear relationships between factors and production, resulting in higher prediction accuracy compared to the SVM algorithm. This makes it more suitable for production capacity prediction in this oil field. Essentially, the BPNN inherently fits features to prediction targets, providing superior descriptions of their nonlinear relationships compared to non-neural network models. Consequently, it performs better in production capacity prediction scenarios characterized by complex factor-production relationships. However, on the R2 reservoir, where the dataset contains fewer samples. The SVM algorithm is adept at handling multi-feature, small-sample data and outperforms due to the inability to sufficiently train the BPNN. Therefore, it can be observed that in shale oil production capacity prediction, SVM excels in handling on-site data with a scarcity of samples, especially in the presence of numerous missing or outlier values, while achieving the desired predictive outcomes. On the other hand, BPNN is more suitable for shale oil production capacity prediction when an ample amount of data is available, demonstrating superior predictive performance compared to traditional ML methods. Consequently, in addressing predictive problems involving the various data volumes of multiple shale oil reservoirs, it is necessary to leverage the strengths and weaknesses of both algorithms to handle the complexities of varying data volumes effectively.

Finally, integrating SVM and BPNN as base learners forms a machine-learning combination model. The hyperparameter settings for the combination model align with the optimization results mentioned earlier. In this study, a weighted averaging method is employed to integrate different base learners into the combination model, as depicted in

where ^{2} values of 0.91 and 0.88, respectively.

Additionally, the performance of different models on the production capacity prediction sample set is detailed in

Model | SVM | BPNN | Combination model | |||
---|---|---|---|---|---|---|

R1 | R2 | R1 | R2 | R1 | R2 | |

R^{2} |
0.74 | 0.82 | 0.86 | 0.77 | 0.91 | 0.88 |

MSE | 3.59 | 2.88 | 2.35 | 3.30 | 1.88 | 2.12 |

MAE | 1.49 | 1.21 | 0.80 | 1.31 | 0.54 | 0.77 |

This paper addresses the challenge of swiftly and accurately assessing the effects of hydraulic fracturing development in shale oil wells. By leveraging the advantages of both data-driven and model-driven methodologies, a combined machine learning model for predicting shale oil productivity is established. Utilizing hydraulic fracturing development data from 236 wells, a data-driven method based on the Random Forest algorithm is used to identify the main controlling factors for different types of shale oil reservoirs. Furthermore, a model-driven prediction model for shale oil productivity is developed by integrating Support Vector Machine algorithms and Back Propagation Neural Network. This model can rapidly respond to the dynamic development performance of shale oil under geological and engineering uncertainties.

1. The data-driven method based on a random forest algorithm is utilized to screen the main controlling factor for shale oil reservoir production capacity. It distinctly identifies that the production capacity is primarily influenced by geological parameters such as oil saturation, permeability, length of horizontal section, and reservoir density within the R1 reservoir. The production capacity is predominantly influenced by well parameters just as oil layer segment length, length of horizontal section, permeability, and oil saturation in the R2 reservoir.

2. The combination model, incorporating SVM and BPNN base learners, establishes a model-driven prediction model for shale oil production capacity. This model enables a swift response toward shale oil production capacity under diverse geological, fluid, and hydraulic fracturing development conditions.

3. The combination model demonstrates superior performance across various datasets compared to singular machine learning methods like SVM and BPNN. SVM performs well when handling limited dataset samples with numerous input features just as the R2 reservoir. Conversely, BPNN excels in revealing nonlinear relationships among various influencing factors when ample dataset samples are available, as demonstrated in the R1 reservoir. Integrating the strengths of base learners, the combination model consistently outperforms in handling diverse datasets. It confirms that the combination model offers heightened reliability and practicality in predicting shale oil production dynamics, establishing its utility in the petroleum engineering domain.

4. The method proposed in this paper demonstrates good performance in handling datasets with varying sample sizes, effectively addressing the prediction challenges arising from differences in data volume among different wells in oilfield sites. However, the method also exhibits limitations. As the sub-learners utilized are based on fundamental machine learning techniques, they are unable to account for complex high-dimensional data containing additional reservoir information, such as permeability fields, saturation fields, and temporal variations in production data. Future research will focus on addressing these shortcomings by integrating methods like deep learning and reinforcement learning to develop production capacity prediction approaches suitable for a broader range of reservoir development scenarios.

Partition variable

Corresponding threshold

Corresponding loss function

Mean of y for cases where the j-th variable does not exceed

Mean of y where the j-th variable surpasses

OOB baseline error for the t-th CART

Resultant OOB error

Sum of all the values of the decrease of the accuracy

Bandwidth of the kernel

Reciprocal of the influence radius

Penalty factor

Relative error of the i single machine learning model

Flow velocity

Relative permeability

Permeability

Viscosity

Threshold pressure gradient

Potential gradient

Initial permeability

Initial reservoir pressure

Stress sensitivity coefficient

Flow velocities of the oil, gas, and water phases

Densities of oil and dissolved gas components within the oil phase

Densities of the gas and water phases

Source or sink terms for the oil, gas, and water components

Porosity

Saturation levels of the oil, gas, and water phases

Pressures of the oil, gas, and water phases

Capillary forces between oil-water and oil-gas

Standardized data

Original data

Standard deviation of the data

Mean of the data

Actual and predicted value

Mean of actual values

Size of actual values within the validation set

Predictions of the base learners

Weights attributed to different base learners

None.

This work was supported by the China Postdoctoral Science Foundation (2021M702304) and Natural Science Foundation of Shandong Province (ZR20210E260).

The authors confirm contribution to the paper as follows: Study conception and design: Qin Qian, Yuliang Su; data collection: Mingjing Lu; analysis and interpretation of results: Anhai Zhong, Wenjun He; draft manuscript preparation: Feng Yang, Min Li. All authors reviewed the results and approved the final version of the manuscript.

The data that has been used is confidential.

The authors declare that they have no conflicts of interest to report regarding the present study.

_{2}storage with enhanced natural gas recovery (CS-EGR)

_{2}-saturated brine injection for sequestration in carbonate aquifers