Industry 4.0 production environments and smart manufacturing systems integrate both the physical and decision-making aspects of manufacturing operations into autonomous and decentralized systems. One of the key aspects of these systems is a production planning, specifically, Scheduling operations on the machines. To cope with this problem, this paper proposed a Deep Reinforcement Learning with an Actor-Critic algorithm (DRLAC). We model the Job-Shop Scheduling Problem (JSSP) as a Markov Decision Process (MDP), represent the state of a JSSP as simple Graph Isomorphism Networks (GIN) to extract nodes features during scheduling, and derive the policy of optimal scheduling which guides the included node features to the best next action of schedule. In addition, we adopt the Actor-Critic (AC) network’s training algorithm-based reinforcement learning for achieving the optimal policy of the scheduling. To prove the proposed model’s effectiveness, first, we will present a case study that illustrated a conflict between two job scheduling, secondly, we will apply the proposed model to a known benchmark dataset and compare the results with the traditional scheduling methods and trending approaches. The numerical results indicate that the proposed model can be adaptive with real-time production scheduling, where the average percentage deviation (APD) of our model achieved values between 0.009 and 0.21 compared with heuristic methods and values between 0.014 and 0.18 compared with other trending approaches.
Along with the fourth industrial revolution, a Smart Factory [
To solve JSSP problems of different scales, several efficient algorithms have been proposed by researchers to improve various parameters of JSSP. Exact arithmetic-based methods using an integer programming formula are introduced by [
Heuristics algorithms are capable of real-time decision-making introduced in [
In this paper, we propose a framework to construct the scheduling policy for JSSPs using the DRL mechanism, it can learn to choose appropriate actions to achieve its goals through system environment interactions and respond to reward receipts based on the impact of each action. First, we formulate the scheduling of a JSSP as a sequential decision-making problem in an MDP concept, and then we represent the state of a JSSP as a graph representation, consisting of operations that represent nodes and constraints that represent edges. Next, we use GIN [
We organized the rest of the paper as follows: Section 2 Explains the methods used in this research. Section 3 introduced the DRLAC for job-shop scheduling with the proposed architecture. Section 4. Introduced the experiments in two parts, using case study and using benchmarks data set then result-analysis stated, finally the Conclusions and Future Research presented in Section 5.
JSSP is a type of scheduling problem that aims to define optimal sequential assignments of machines to multiple jobs consisting of a sequence of operations while maintaining problem constraints (such as processing precedence and machine sharing).
When a request is received for n jobs defined as
Each machine can produce kinds of product with different efficiency, defined as
The processing of a job on a machine is called an operation; each operation can be processed on one or more suitable machines with different processing times. The operation can be denoted by
Therefore, the job shop-scheduling problem is regarded as a sequential decision problem for assigning jobs to specific machines
As soon as the process of a job can begin at any time, the required machine becomes available.
Each job must pass through a series of pre-defined operations, where operation cannot be started until the previous one is complete, (i.e., processing
One or more machines must process each operation completely.
When assigning jobs to machines, it is necessary to consider whether the machines have operations that can be processed by a machine at this time.
If there are multiple optional choose operation sets, the machine can select one of them for processing. Otherwise, it needs to wait for the arrival of the next operation processing completion event.
The solution approach applied in this paper is based on a graph, and represents JSSP as
N is the set of nodes representing the processes and
To illustrate job-scheduling problems, we assume the JSSP problem as
Later, in this paper, we can potentially deal with complex environments with dynamics and uncertainty conflicting such as job arriving, by adding or removing certain nodes and/or lines from the disjunctive graphs.
We represent the state of JSSP using a disjunctive graph like the following:
Nodes are represented as operations,
Conjunctive edges are represented as precedence constraints between two nodes,
Disjunctive edges are represented as machine-sharing constraints between two operations.
Since the next job processing state only changes based on the present state, we can model the job shop-scheduling problem as an MDP. Therefore, we will consider the decisions of dispatching as actions of changing the graph, and we formulate the MDP model as the following:
For each node
Our problem environment will feedback a corresponding reward for each agent action, so the paper will propose to set the reward in the job shop scheduling process to be related to machine functions, which means that the optioned makespan will be a function of cumulative rewards. This leads to obtaining the optimal policy to solve our problem and get an optimal or near-optimal solution (min makespan); we formulate the reward as the minim of the makespan
GIN is one such example among many maximally powerful GNNs, while being simple [
The disjunctive graph based on the MDP formula provides an overview of the scheduling states containing numerical and structural information such as processing time of operation, each processing order of the machine, and precedence constraints. The efficient transmission is viable when extracting all the state information from disjunctive graphs. This prompts us to select the stochastic policy
Given a graph
After updates iterations
The disjunctive
In reinforcement learning methods, comparing value-based methods, policy-based methods are more suitable for continuous state and action space problems such as JSSP. This is useful in our problem environment because the action space is continuous and dynamic and has a faster convergence. Reinforcement learning occurs when the agent interacts with its surrounded environment to perform actions and learn by a trial-error learning method.
As one of the policy-based methods, the actor-critic one of the RL training algorithms can obtain good performance by limiting the policy update to reduce parameter settings sensitivity. The Actor refers to the policy network πθ described in the previous section, it controls how the agent behaves by learning the optimal policy (policy-based). The critic vφ shares the GIN network with the actor and uses
To raise learning, we follow the principle of AC and update network parameters. For JSSP, DRL can implement real-time scheduling and adapt strategies based on the environment’s current state when dealing with problems. DRL aims to enable the agent to learn to take actions to maximize the cumulated reward from the process of interacting with the environment.
We present in
We define the environment as a set of jobs with their assigned machines and their processing time. For the JSP environment, the agent needs to observe the information of the environment state at each moment, such as the processing status of jobs, the assigned machine matrix, and the processing time matrix, and then take action to select operations for appropriate machines to make jobs be processed in an orderly manner with the minimum of maximum completion time. To select an action
This is the part of the agent policy update, where the buffer memory forms mini-batches to update
We adopt the Actor-Critic algorithm to train our agent and provide details of our algorithm in terms of pseudo-code, as shown in Algorithm 1; it provides pseudo-code for RL agent interacting with an environment with changing action sets.
We used the settings that were previously used in our research and they performed well in [
Episodes | 1000 | Hidden layers with hidden dimension 32 | 2 |
Learning rate Α | 0.001 | Hidden layers with hidden dimension 64 | 2 |
Discount factor Γ | 1 | The coefficient for policy loss | 1 |
Number of iterations K | 2 | Value function | 0.01 |
Epochs of updating network | 1 | Seed | 1 |
In this research, we use the Adam optimizer for training the AC algorithm. Other parameters follow the default settings in PyTorch [
The evaluation criterion in the experiment is the maximum completion time ‘makespan’ and our goal is to minimize it.
Compared to optimization strategies, Reinforcement learning allowed making moves with a negative reward. Therefore, we set the returned reward in our model as negative rewards. In general, we prefer to have negative returns for stability purposes. If you do back-propagation equations, you will see that yield affects gradients. Thus, we like to keep their values in a specific appropriate range, so if you increase or decrease all rewards (good and bad) equally, nothing will change.
We suppose that a job instance for two jobs
Operations of J1 | Operations of J2 | |||||
---|---|---|---|---|---|---|
Machine-Order | 3 | 1 | 2 | 1 | 2 | 3 |
Processing-Time | 20 | 20 | 7 | 10 | 15 | 10 |
When modeling this instance as a disjunctive graph
Each job has three operations, each operation
Now to prove the NP-hard problem of job scheduling, we suppose an additional job
When we assume that the operations of two jobs have the same requirement for each machine, a conflict engenders between a subset of jobs, and the requirement of at least one machine exceeds its capacity.
We consider the previous instance of three jobs
The first operation
The optimal solution of this instance can obtain different values of compilation time based on Remaining Processing Time RPT obtains C_{max }= 70 or 65 and 60 UT with different job distribution in two cases shown in
The algorithm proposed can train no of steps to solve the conflicting scheduling based on longest RPT, through 100 episodes number with six steps of problem environment, the model can generate the optimal scheduling and the minimum of maximum completion time can be obtained based on the cumulative rewards generated through training, illustrated in
To prove the performance of the proposed algorithm, we evaluate our model on instances of various benchmark dataset sizes benchmarking instances of OR-Library [
Instance -size | Optimal | SPT | LPT | FIFO | DRLAC | |
---|---|---|---|---|---|---|
FT06 6 × 6 | 55 | 88 | 77 | 65 | 65 | 0.181818 |
FT10 10 × 10 | 930 | 1074 | 1295 | 1184 | 1091 | 0.173118 |
LA01 10 × 5 | 666 | 751 | 822 | 772 | 693 | 0.040541 |
LA02 10 × 5 | 655 | 821 | 990 | 830 | 799 | 0.219847 |
LA03 10 × 5 | 597 | 672 | 825 | 755 | 631 | 0.056951 |
LA04 10 × 5 | 590 | 711 | 818 | 695 | 658 | 0.115254 |
LA05 10 × 5 | 593 | 610 | 693 | 610 | 620 | 0.045531 |
LA06 15 × 5 | 926 | 1200 | 1125 | 926 | 957 | 0.033477 |
LA07 15 × 5 | 890 | 1034 | 1069 | 1088 | 960 | 0.078652 |
LA08 15 × 5 | 863 | 942 | 1035 | 980 | 989 | 0.146002 |
LA09 15 × 5 | 951 | 1045 | 1183 | 1018 | 994 | 0.045216 |
LA11 20 × 5 | 1222 | 1473 | 1467 | 1272 | 1233 | 0.009002 |
LA12 20 × 5 | 1039 | 1203 | 1240 | 1039 | 1171 | 0.127045 |
LA13 20 × 5 | 1150 | 1275 | 1230 | 1199 | 1222 | 0.062609 |
The results of makespan in
Instance-Size | Optimal | GA | DRL | MARL | DDPG | DRLAC | |
---|---|---|---|---|---|---|---|
ORB1 10 × 10 | 1059 | 1379 | 1131 | 1154 | 1211 | 1074 | 0.014164306 |
ORB2 10 × 10 | 888 | 1141 | 993 | 931 | 1002 | 978 | 0.101351351 |
ORB3 10 × 10 | 1005 | 1300 | 1092 | 1095 | 1150 | 1070 | 0.064676617 |
ORB4 10 × 10 | 1005 | 1229 | 1118 | 1068 | 1132 | 1081 | 0.075621891 |
ORB5 10 × 10 | 887 | 1135 | 972 | 974 | 1045 | 902 | 0.016910936 |
ORB6 10 × 10 | 1010 | 1309 | 1140 | 1064 | 1106 | 1104 | 0.093069307 |
ORB7 10 × 10 | 397 | 505 | 432 | 424 | 468 | 469 | 0.181360202 |
ORB8 10 × 10 | 899 | 1174 | 979 | 956 | 1022 | 1000 | 0.112347052 |
ORB9 10 × 10 | 934 | 1158 | 1005 | 996 | 1082 | 1050 | 0.124197002 |
From the results in
In addition, DDPG is better than DRL from
The contributions of this paper are summarized as We propose a problem formulation for JSSP as a sequential decision-making problem then, we design the model to represent the scheduling policy based on Graph Isomorphism Network, then we introduced the training algorithm as the actor-critic network algorithm. Our design of a policy network has advantages where first, it can potentially deal with more environments with dynamics and uncertainty systems such as new job arriving and random machine breakdown, by adding or removing certain nodes and/or lines from the disjunctive graphs.
We noted that the GIN representation scheduling for JSSP outperforms practically favored dispatching rules on various benchmark JSSP instances and provides an effective scheduling solution to cope with new job instances. Second, since all nodes shared all parameters in the graph, this property effectively enables generalization to situations of different sizes without training or knowledge transfer.
Finally, through observing the simulation results, we find that if the goal of scheduling is to minimize tardiness or makespan…Etc. Then it may be appropriate to put negative rewards if one or more jobs break their deadlines. For Future work, our model could extend to other shop scheduling problems (e.g., flow-shop). We can introduce another type of graph network as Graph Pointer Networks (GPNs) using reinforcement learning (RL) for tackling the optimization problem.