Tunable-bias based optical neural network for reinforcement learning in path planning

Zhiwei Yang; Zhiwei Yang; Tian Zhang; Tian Zhang; Jian Dai; Jian Dai; Kun Xu; Kun Xu

doi:10.1364/OE.516173

1. Introduction

As a critical technology in the deep learning domain [1], artificial neural networks (ANNs) [2] have been popular research for a long time and have been applied in large numbers of fields, such as image recognition [3] and speech recognition [4]. However, the current mainstream hardware technology of ANNs is still based on traditional electronic components, whose speed and energy consumption are limited by the gradually invalidated Moore’s law and Von Neumann architecture with separated storage and calculation. Owing to the large bandwidth, low power consumption and low crosstalk of optical platform, optical neural networks (ONNs), whose computational speed can be increased by 4-6 orders of magnitude [5], have obtained extensive attentions [6–8]. At present, the mainstream ONNs are constructed using silicon-based Mach-Zehnder interferometers (MZIs) [6], wavelength division multiplexing (WDM) [7] and spatial diffraction [8,9]. Among them, MZI-based ONNs, which have the advantages of high integration, reconfiguration and strong robustness, have been widely concerned.

So far, most of the MZI-based optical computing works use MZI arrays to realize linear multiplication or convolution operations in ONNs [10,11]. Few works implement MZI-based ONNs with bias [12–14], though bias is significant for neural networks. Compared to the neural network without bias, the decision boundary of the network with bias term cannot necessarily pass through the coordinate origin, which can result in rapid convergence, high accuracy and great data fitting ability for the model. In addition, matrices with unequal rows and columns often appear in neural networks, especially complex networks, so realizing such matrices by photonic means is of great significance. Currently, one realization method is using singular value decomposition (SVD) [12,15]. However, considering the universal of the MZI system, two unitary matrices generated by SVD are usually realized by repeatedly mapping on a same MZI system. Thus, the size of MZI mesh should be same as the unitary matrix which is with a larger size, leading to the waste of the some MZIs. Another effective realization method is to leave the redundant input or output ports of the unitary matrix networks to idle, but also leads to a lot of hardware resource waste. Notably, setting the redundant input port to a bias has been directly employed in previous work [16], but in-depth systematic research on the mechanism of achieving bias, and even the more systematic setting, such as the case of more idle input ports has not been made. Therefore, studying the underlying mechanism of bias in MZI-based ONNs and the effective method of constructing optical networks with bias, especially in matrices with unequal rows and columns, are significant.

On the other hand, the current research works on MZI-based ONNs mainly complete some simple supervised learning tasks, such as handwritten digit recognition and other classification tasks [17,18], while more complex functions for ONNs have rarely been studied. The acceleration effect of ONNs can be better highlighted when they deal with more complex tasks, such as unsupervised learning tasks and reinforcement learning tasks. Compared with the main tasks completed by unsupervised learning, such as clustering and dimensionality reduction tasks, tasks completed by reinforcement learning are more complex, including path planning tasks and so on. In 2020, Fulvio Flamini et al. [19] combined SARSA, Q-learning and projective simulation methods to train the MZI-based decision tree array model and realized a 3D grid world task, but the system using this model is with large scale and the utilization rate of the overall system is not high. Except for this work, there are few works using MZI-based-ONNs-like scheme to solve reinforcement learning tasks.

In this article, we propose a tunable-bias based optical neural network (TBONN). By improving the utilization rate of the MZIs, the trainable weights of the network can be increased, leading to the network with a more powerful representational capacity. Moreover, we analyze the basic principle of optical bias and use two classification instances to systematically analyze the effect of the number and the location of optical bias on TBONN. Further, we introduce an optical deep Q network (ODQN) algorithm combining the original DQN algorithm with the photonic backpropagation (BP) algorithm [16] to train the TBONN and complete the path planning tasks of 2D and 3D grid worlds. The noise errors in the TBONN and ODQN algorithm are also analyzed. Our proposed TBONN combined with ODQN shows excellent potential for deep reinforcement learning tasks, such as autonomous driving [20], robot [21] and so on.

2. Theory of the proposed TBONN

As shown in Fig. 1(a), our proposed TBONN is composed of the linear process part (UML₁, UML₂,…,UML_l)and the nonlinear process part (f₁, f₂,…, f_l). The unitary matrix layer (UML), which is the linear process part in TBONN and as shown in Fig. 1(b), consists of phase shifters array and MZIs array to implement matrix-vector multiplication. According to the SVD principle [22], any matrix (H) may be decomposed as:

(1)$${\boldsymbol H} = {\boldsymbol{U\varSigma}}{{\boldsymbol V}^\dagger }$$

where U is a c × c unitary matrix, Σ is a c × d rectangular diagonal matrix and V^† is the complex conjugate of a d × d unitary matrix V. Besides, any U and V^† can be implemented with MZIs and phase shifters, and Σ can be implemented using optical attenuators or MZIs [23]. Furthermore, any unitary transformations of order d, according to the decomposition method proposed by Clements et al. [24], can be realized by a multiport interferometer with d(d−1)/2 MZIs and d phase shifters. Notably, the linear operation of neural networks often leads to high computational complexity, and neural networks can also achieve good results without using a complete matrix [25]. To simplify the device needed for the ONNs, our proposed TBONN construct a system that only realize the unitary matrix (U or V) rather than the arbitrary matrix (H). In theory, our proposed TBONN may degrade the performance of ONN due to a much smaller parameter space, but the degradation isn’t significant. To prove the effectiveness of this method, the comparison of our proposed TBONN and the ONNs with the realization of matrix H will be made in the content below.

Fig. 1. (a) The general scheme of our proposed TBONN based on MZIs. (b) As the linear process part in TBONN, unitary matrix layer consists of MZIs array and phase shifters array. Each MZI comprises an internal phase shifter (θ), an external phase shifter (φ) and two 3-dB directional couplers. The a_i (i = 1, 2, …, m), b_j (j = m + 1, m + 2,…, n), and o_k (k = 1, 2, …, n) represent the inputted signals modulated on amplitude, inputted bias, and outputted signals, respectively.

Download Full Size | PDF

It can easily be found that the number of the input ports and output ports in UMLs should be the same, since the unitary matrix is a square matrix. However, the input and output ports for the weight operation in many neural networks are unequal [26]. When the number of output ports is smaller than that of input ports, the redundant output ports in UMLs can be freely selected from all output ports to idle. This is a method that most works about ONNs have employed [14]. On the other hand, when the number of output ports is larger than that of input ports, which means the redundant ports will occur at the input, a common process method is to set the unused input ports as 0. Assuming the last few ports are set as 0, the output signal O processed by a unitary matrix can be given by:

(2)$${\boldsymbol O} = {\boldsymbol{WA}} = \left( {\begin{array}{cccccc} {{w_{\textrm{11}}}}&{{w_{\textrm{12}}}}& \cdots & \cdots & \cdots &{{w_{\textrm{1}n}}}\\ {{w_{\textrm{21}}}}&{{w_{\textrm{22}}}}& \cdots & \cdots & \cdots &{{w_{\textrm{2}n}}}\\ \vdots & \vdots & \ddots &{}&{}& \vdots \\ \vdots & \vdots &{}& \ddots &{}& \vdots \\ \vdots & \vdots &{}&{}& \ddots & \vdots \\ {{w_{n\textrm{1}}}}&{{w_{n\textrm{2}}}}& \cdots & \cdots & \cdots &{{w_{nn}}} \end{array}} \right)\left( {\begin{array}{c} {{a_\textrm{1}}}\\ \vdots \\ {{a_m}}\\ \textrm{0}\\ \vdots \\ \textrm{0} \end{array}} \right)$$

where m < n, a_i (i = 1, 2, …, m) is the inputted signal, and w_jk (1 ≤ j ≤ n, 1 ≤ k ≤ n), which is a complex number, is an element of the weight matrix W, respectively. We can find that the elements in the last n-m columns of W have no effect on the output when only m input ports are used, leading to a significant waste of the hardware resource, especially for the MZI arrays with large scale.

Here, the input powers of the idle input ports (b_m_+ 1, b_m_+ 2,…, b_n) in our proposed TBONN as shown in Fig. 1(b) are set as 1, so that previously useless elements in the weight matrix and the idle input ports can jointly realize the bias function. If we set the bias port behind the inputted signal, this process can be described by the following equation:

(3)$$\begin{aligned} {\boldsymbol O} &= {\boldsymbol W}{{\boldsymbol A}_\textrm{1}}\\ &= \left( {\begin{array}{cccccc} {{w_{\textrm{11}}}}&{{w_{\textrm{12}}}}& \cdots & \cdots & \cdots &{{w_{\textrm{1}n}}}\\ {{w_{\textrm{21}}}}&{{w_{\textrm{22}}}}& \cdots & \cdots & \cdots &{{w_{\textrm{2}n}}}\\ \vdots & \vdots & \ddots &{}&{}& \vdots \\ \vdots & \vdots &{}& \ddots &{}& \vdots \\ \vdots & \vdots &{}&{}& \ddots & \vdots \\ {{w_{n\textrm{1}}}}&{{w_{n\textrm{2}}}}& \cdots & \cdots & \cdots &{{w_{nn}}} \end{array}} \right)\left( {\begin{array}{c} {{a_\textrm{1}}}\\ \vdots \\ {{a_m}}\\ \textrm{1}\\ \vdots \\ \textrm{1} \end{array}} \right)\\ &=\left( {\begin{array}{ccc} {{w_{\textrm{11}}}}& \cdots &{{w_{\textrm{1}m}}}\\ \vdots & \ddots & \vdots \\ {{w_{n\textrm{1}}}}& \cdots &{{w_{nm}}} \end{array}} \right)\left( {\begin{array}{c} {{a_\textrm{1}}}\\ \vdots \\ {{a_m}} \end{array}} \right) + \left( {\begin{array}{ccc} {{w_{\textrm{1(}m\textrm{ + 1)}}}}& \cdots &{{w_{\textrm{1}n}}}\\ \vdots & \ddots & \vdots \\ {{w_{n(m + \textrm{1})}}}& \cdots &{{w_{nn}}} \end{array}} \right)\left( {\begin{array}{c} \textrm{1}\\ \vdots \\ \textrm{1} \end{array}} \right) \end{aligned}$$

\textrm{set }\quad\quad\quad{\left( {\begin{array}{ccc} {{w_{\textrm{11}}}}& \cdots &{{w_{\textrm{1}m}}}\\ \vdots & \ddots & \vdots \\ {{w_{n\textrm{1}}}}& \cdots &{{w_{nm}}} \end{array}} \right) = {{\boldsymbol W}_\textrm{1}},}\quad{\left( {\begin{array}{c} {{a_\textrm{1}}}\\ \vdots \\ {{a_m}} \end{array}} \right) = {{\boldsymbol A}_\textrm{2}},}\quad{\left( {\begin{array}{ccc} {{w_{\textrm{1(}m\textrm{ + 1)}}}}& \cdots &{{w_{\textrm{1n}}}}\\ \vdots & \ddots & \vdots \\ {{w_{n(m + \textrm{1})}}}& \cdots &{{w_{nn}}} \end{array}} \right)\left( {\begin{array}{c} \textrm{1}\\ \vdots \\ \textrm{1} \end{array}} \right) = }{\boldsymbol B}

(4)$$\textrm{then }\quad\quad\quad\quad\quad\quad{\boldsymbol O}\textrm{ = }{{\boldsymbol W}_\textrm{1}}{{\boldsymbol A}_\textrm{2}}\textrm{ + }{\boldsymbol B}$$

It can be easily found that Eq. (4) is similar to the forward-propagation process of a biased neural network [27]. Thus, we name the B as optical bias. Further, the bias can be trained by adjusting the phase shifters. This method can improve the performance of ONNs and increase the utilization of existing hardware. Besides, the nonlinear process part in TBONN (f₁, f₂,…, f_l) can be achieved by electro-optic nonlinear activation [28] and photodetector array [29,30], which can detect the optical power to realize the square nonlinearity.

To further illustrate the principle and effect of optical bias in TBONN, we complete two typical simple classification tasks, including the classification for the two-dimensional dataset which splits the rectangle into two triangle areas, and the modulation format recognition. First of all, we use Neuroptica [31], written in Python, to generate the two-dimensional dataset which can be separated basically along a diagonal line. As shown in Fig. 2(a), the red cross marks and blue circle marks represent the data in two categories, respectively. Then we shift the total data in a direction perpendicular to the diagonal line by distance T as the inset shown in Fig. 2(b). To classify these data, we construct a neural network with an architecture of 3-5-5-2. Accordingly, the TBONN needs three UMLs with five ports (l = 3, n = 5) and optical bias is only added at the input port. Notably, the input data needs 3 input ports to input the position of the data (X, Y) and normalized information Z = (P-X²-Y²)^1/2 (P is set as 2). The TBONN is trained using the adjoint variable method (AVM) [16] and adaptive moment estimation (Adam). To study the effect of the optical bias on the performance of TBONN, the TBONN with zero bias port (blue solid line), one bias port (orange solid line) and two bias ports (green solid line) in the case of T = 0.1 are considered as shown in Fig. 2(b). It can be found that the average prediction accuracy of TBONN with 2 bias in the last 30 generations reaches 97.5%, while that with 1 bias and 0 bias can only reach 96.7% and 88.3%, respectively. This suggests that adding optical bias can greatly improve the performance of the TBONN, which can mainly be attributed to the fact that the optical biases make the decision boundary not have to pass through the coordinate origin. Besides, the average prediction accuracy of TBONN with 2 bias (97.5%) is slightly larger than that with 1 bias (96.7%), which can be attributed to the fact that more adjustable parameters are introduced in TBONN by other added optical biases. Furthermore, we found that the average prediction accuracy of TBONN with 2 bias in the cases of T = 0.2 (red dotted line) and T = 0.3 (purple dotted line) keeps at 96.8% and 98.1%, respectively. It is demonstrate that the TBONN can still maintain high performance even if the decision boundary is far away from the diagonal line.

Fig. 2. (a) A two-dimensional dataset splits the rectangle along a diagonal line. (b) The classification accuracies of the TBONN with zero bias (b_m +₁= 0, b_m +₂= 0 and m = 3), one bias (b_m +₁= 1 and b_m +₂= 0) and two bias ports (b_m +₁= 1 and b_m +₂= 1) when T = 0.1, and the classification accuracies of the TBONN with two bias ports when T = 0.2 and 0.3, respectively. (c) The classification accuracies of the TBONN with zero bias (b_m +₁= 0 and b_m +₂= 0), one bias (b_m +₁= 1 and b_m +₂= 0), two bias ports (b_m +₁= 1 and b_m +₂= 1) for the two-dimensional dataset when the decision boundary is the diagonal line. (d) The classification accuracies of the TBONN with zero bias (m = 4 and n = 4), one bias (m = 4 and n = 5), and two bias ports (m = 4 and n = 6) for the modulation format recognition.

Download Full Size | PDF

Moreover, we use the identical network to complete the two-dimensional dataset when the decision boundary is the diagonal line, and the calculated results are presented in Fig. 2(c). We can find that the average prediction accuracy of TBONN with 0 bias, 1 bias and 2 bias is 92.1%, 95.7% and 97.1%, respectively. It is explained that adding bias can also significantly improve the performance of the TBONN by introducing more adjustable parameters for the two-dimensional dataset when the decision boundary is the diagonal line. Next, we study the case where the number of input and output ports is the same, such as the modulation format recognition. This dataset has four effective attributes and four categories, so the input and output ports of the TBONN should be increased if optical bias is added. For example, the network architecture of TBONN includes two UMLs with 6 ports (l = 2, n = 6) and the calculated results are presented in Fig. 2(d). We can also find that the more optical biases are introduced, the larger average prediction accuracy of TBONN can be achieved.

Then, we consider the effect of the position of optical bias on the performance of TBONN. For the two-dimensional dataset which splits the rectangle into two triangle areas as shown in Fig. 2(a), the TBONN with biases located at the fourth and fifth ports (b₄ and b₅, purple solid line), the first and fifth ports (b₁ and b₅, blue solid line), the second and third ports (b₂ and b₃, orange solid line), the second and fourth ports (b₂ and b₄, green solid line) are considered as shown in Fig. 3(a). It can be found that the average prediction accuracy of TBONN with biases at (b₁, b₅), (b₂, b₃) and (b₂, b₄) in the last 30 generations only reach 94.98%, 96.07% and 95.77%, which are all below that of TBONN with biases at (b₄, b₅), 97.1%, as the purple line shown. This suggests that adding all optical biases to the same side beside the inputted signals can improve the performance of the TBONN even more. Moreover, as seen in Fig. 3(b), we can find that larger average prediction accuracy of the TBONN can also be achieved by adding all optical biases to the same side of the inputted signals (b₅ and b₆, purple solid line) for the modulation format recognition. To explain this characteristic, the mean value and variance of the real part of the weight for the inputted signals (Re(W_I), red pillar), the real part of the weight for the inputted bias (Re(W_B), blue pillar), the imaginary part of the weight for the inputted signals (Im(W_I), green pillar) and the imaginary part of the weight for the inputted bias (Im(W_B), purple pillar) in the last epoch for the two-dimensional dataset are calculated as shown in Fig. 3(c). It can be found that the difference (0.098) between the mean value of Re(W_I) and Re(W_B) with biases at (b₄, b₅) is more than that with biases at (b₁, b₅), (b₂, b₃) and (b₂, b₄), and the difference (0.051) between the mean value of Im(W_I) and Im(W_B) with biases at (b₄, b₅) is also greater. This suggests that the dependence of W_I and W_B is weakened when all optical biases are added to the same side beside the inputted signals. This is beneficial to make the TBONN achieve better performance. Moreover, as seen in Fig. 3(d), we can also find that the dependence of W_I and W_B in TBONN with biases at (b₅, b₆) is weaker than other cases for the modulation format recognition. A similar phenomenon can also be found in the mean value of MZIs’ phases, which control the inputted signals and bias signals, respectively, as shown in Fig. 3(e). Here, we don’t consider the MZIs which will influence all the inputted and bias signals. This further suggests that the phase distributions of MZIs for inputted and bias signals are quite different. Thus, by adding all optical biases to the same side beside the inputted signals, the W_I and W_B can better fit the data, leading to a better performance for TBONN. Notably, the TBONN including two UMLs with 6 ports (l = 2, n = 6) in the modulation format recognition is also simulated using Lumerical interconnect software. It can be seen from Fig. 3(f) that output values are basically the same when we input 6 same signals and set the same weights, which verifies the reliability of the Neuroptica we have used. Moreover, the weight for bias can’t be fixed like ANN using electric means. However, whether the weights are fixed or not has little effect on the network, because the bias in the network is usually trained to complete the tasks. Therefore, our proposed TBONN is still suitable for most situations.

Fig. 3. (a) The classification accuracies of the TBONN with biases at (b₄, b₅), (b₁, b₅), (b₂, b₃) and (b₂, b₄) for the two-dimensional dataset when the decision boundary is the diagonal line. (b) The classification accuracies of the TBONN with biases at (b₅, b₆), (b₁, b₆), (b₃, b₄) and (b₂, b₅) for the modulation format recognition. (c) The mean value and variance (numbers in parentheses) of Re(W_I), Re(W_B), Im(W_I) and Im(W_B) in the TBONN with biases at (b₄, b₅), (b₁, b₅), (b₂, b₃) and (b₂, b₄) in the last epoch for the two-dimensional dataset. (d) The mean value and variance of Re(W_I), Re(W_B), Im(W_I) and Im(W_B) in the TBONN with biases at (b₅, b₆), (b₁, b₆), (b₃, b₄) and (b₂, b₅) in the last epoch for the modulation format recognition. (e) The mean value and variance of MZIs’ phases, which control inputted signals (phase(W_I)) and bias signals (phase(W_B)), respectively, for the two-dimensional dataset and the modulation format recognition. The inset shows the MZIs controlling W_I (red dotted box) and W_B (blue dotted box) in the TBONN with biases at (b₄, b₅). (f) The output result using Neuroptica and Lumerical interconnect software, respectively.

Download Full Size | PDF

3. Method of the proposed TBONNs for reinforcement learning

The current research works on MZI-based ONNs mainly complete some simple supervised learning tasks, such as handwritten digit recognition and other classification tasks, but the acceleration effect of ONNs can be better highlighted when they deal with more complex tasks, such as the task employing reinforcement learning. Thus, we propose an optical deep Q network (ODQN) algorithm using the TBONN. In the ODQN algorithm, we combine the DQN algorithm [32], which is a typical deep reinforcement learning algorithm, the Adam algorithm and AVM to train the TBONN by optimizing for the phase shifters in the optical mesh. The flowchart of the learning process for the agent based on the ODQN algorithm is shown in Fig. 4. Moreover, path planning is a critical technology in many emerging industries, such as robot visual navigation, autonomous driving and so on [33,34]. To further illustrate the effect of the proposed TBONN and ODQN algorithm, in this part, we use the ODQN algorithm to train the proposed network to complete path planning tasks, including 2D grid world and 3D world.

Fig. 4. The flowchart of the learning process for the agent based on ODQN.

Download Full Size | PDF

The details of ODQN algorithm are outlined as follows:

1. Initialize the network architecture, parameters and environment state. We construct two TBONNs, Q evaluate network and Q target network, which are set with the same network architecture and the same initial weight. The parameters mainly include memory capacity N, learning rate α, reward decay coefficient γ, intervals of Q target network weight copying the updated weight of the Q evaluate network β, decay coefficient of exploration environment probability ɛ and network weight w. The network weight w is updated during the learning process of the agent while the remaining parameters are manually set before training. Notably, w of the Q evaluate network is updated every iteration, while w of the Q target network is updated every few iterations, which can make the algorithm more stable. Besides, the environment state should also be initialized. It means the agent is initialized at the starting point for the path planning problem in the 2D and 3D grid world.
2. Explore the environment by the agent. In the early stage of learning, the experience stored in the memory experience is poor. Thus the agent needs to constantly explore the environment to accumulate experience, which includes the current state, action, reward value and the next state. Notably, the weights of the Q evaluate network and the Q target network are both not updated. Usually, this environment exploring process needs few iterations. Here, we set the number of this iteration as 200. In each iteration, the agent first obtains the current state from the environment, then the decision maker generates an action. For all the step in this algorithm, there are two methods to generate action, including randomly generating the action, and generating a definite action according to the Q value predicted by the Q evaluate network. The first method will be selected with probability p, which decays from 1 to 0 with a decay rate ɛ, and the second method is with probability 1-p. Finally, the environment state is updated according to the generated action and the experience (last state, action, reward and updated state) is stored in memory.
3. Learn and continuously explore the environment by the agent. After obtaining a few experiences by step 2, the Q evaluate network is trained based on partial experience randomly selected from the existing experience and label generated by the Q target network. Then we take an action, update the environment and store this experience. We repeat this training of the Q evaluate network, decision making, updating and storing for β times. Finally, we update the Q target network by setting its weight the same as the weight of the Q evaluate network.
4. Repeat step 3 until the path planning problem completes 500 episodes.

3.1 2D grid world

First of all, as shown in Fig. 5(a), we build a 10 × 10 grid world where the agent starts at position P_A = (0, 0) and terminates at position P_R = (3, 8). The agent can take up, down, left, and right actions to explore the environment and its reward value r received from the environment can be set as:

(5)$$r = \left\{ {\begin{array}{l} \textrm{1}\\ { - \textrm{1}}\\ {\textrm{0}\textrm{.001}} \end{array}\textrm{ }\begin{array}{*{20}{l}} {\textrm{arrive at the target }}\\ {\textrm{encounter an obstacle or a boundary}}\\ {\textrm{otherwise }} \end{array}} \right.$$

Each episode will be terminated when the agent arrives at the target or its path length reaches 100, but the agent will remain in the original state and take the next action when the agent encounters boundaries and obstacles. To illustrate the effect of the proposed TBONN and ODQN algorithm, we construct a TBONN consisting of one UML and one photodetector array (l = 1), namely, the network has no hidden layer. Notably, the agent can move in four directions in the 2D grid world. Thus, the UML needs four input ports (n = 4). The TBONN with zero bias (b₃= 0, b₄= 0), one bias (b₃= 1, b₄= 0) and two bias ports (b₃= 1, b₄= 1) are optimized by the proposed ODQN algorithm and the calculated results are presented in Fig. 5(b). Here, N = 2000, α=0.01, γ=0.9, β=300 and ɛ=0.99, respectively. As shown in Fig. 5(b), the step of the TBONN with two bias ports (blue solid line) basically reaches the shortest step length (11 in this case) after 60 episodes, which indicates that the agent can reach the destination. However, the blue solid line shows obvious vibration. To more comprehensively evaluate the performance of the TBONN, avg100steps, which is the average step length in the last 100 episodes, is employed. The minimum of avg100steps of the TBONN with two bias ports (purple solid line) reaches 11.4, which proves the effectiveness of our proposed ODQN algorithm based on the TBONN in a 2D grid world. However, the minimum of avg100steps of the TBONN with one bias port (orange solid line) and zero bias (green solid line) is 12.71 and 100, respectively. This suggests that adding optical bias can greatly improve the performance of the TBONN and make the TBONN have a more powerful representational capacity. Besides, the minimum of avg100steps of TBONN with 2 bias (11.4) is slightly smaller than that with 1 bias (12.71), which can be attributed to the more adjustable parameters in TBONN brought by the more optical biases.

Fig. 5. (a) The 10 × 10 grid world environment. The red, black, blue and green rectangles represent the current position of an agent, obstacles, starting position and destination, respectively. (b) The step and avg100steps of the TBONN with zero bias (b₃= 0, b₄= 0 and n = 4), one bias (b₃= 1 and b₄= 0) and two bias ports (b₃= 1 and b₄= 1) in the 10 × 10 grid world. (c) The step and avg100steps of the ANNs in the 10 × 10 grid world. Here, the inset histogram shows the computing time of the environment, the ANN and the decision maker. (d) The step and avg100steps of the TBONN using two UMLs and a diagonal matrix layer which can implement the arbitrary matrix (H).

Download Full Size | PDF

To further illustrate the effect of the proposed ODQN algorithm, we build an ANN with the same architecture as the above TBONN (4-4) when using the original DQN algorithm. However, the agent can’t reach the destination, and the DQN algorithm has a worse performance compared with our proposed ODQN. This is because every element in weight matrix W of ODQN using TBONN is a complex number and the trainable parameters of ODQN actually is twice as many as that of DQN using ANN. Further, we double the trainable parameters of DQN by expanding the architecture of ANN to 4-4-4. It can be clearly seen from Fig. 5(c) that the step of the ANN (blue solid line) can reach the shortest path length after 200 episodes and the minimum of avg100steps (purple solid line) reaches 11.73, which is close to the minimum of avg100step of ODQN with 2 bias (11.4). This suggests that the performance of the ODQN algorithm is better than the DQN algorithm with the same architecture, and is similar to the DQN algorithm with double trainable-parameters architecture. It demonstrates the powerful representational capacity of the TBONN. Moreover, we test the computing time of the environment, the ANN and the decision maker on a 2.3-GHz Intel Core i5-8300 H processor when the number of inference is 10000, and the result is shown in the histogram of the inset. It can be observed that the computing time of ANN accounts for 59.2% of the total time. Notably, the computing time of the TBONN is almost negligible. Hence, compared with ANNs, the inference speed of using our proposed TBONN to complete 2D grid world task can be increased by 2.5 times.

Specially, the linear operation of neural networks often leads to high computational complexity [25]. To illustrate the effectiveness of the proposed simple TBONN with one UML, we construct a 4 × 4 ports TBONN which can implement the arbitrary matrix (H) composed of two UMLs and a diagonal matrix layer. The TBONN with two bias ports is trained by the ODQN algorithm based on the 10 × 10 maze and the calculated results are presented in Fig. 5(d). It can be found that the step (blue solid line) can reach the shortest step length after 170 episodes and the minimum of avg100steps (purple solid line) can reach 11.7. This suggests that TBONN with one UML can achieve similar results to the TBONN with a complete layer for 2D grid world tasks. This is because parameter quantity of the TBONN with one UML and two bias ports is enough to achieve very good results for 2D grid world task, and adding more parameters doesn’t significantly improve performance. In other words, substituting the matrix H with only one UML is an efficient way to reduce the system area and the required trainable parameters (numbers of the internal and the external phase shifter) by more than half while maintaining the performance. Notably, when facing more complex tasks where more complex network is needed, our proposed TBONN with only one unitary square matrix may more obviously degrade the performance of ONN due to its insufficient parameter quantity. This can be further studied in next work.

Moreover, one of the prominent problems in TBONN is the performance degradation caused by inaccurate parameters of the current state-of-art photonic devices. There are three main types of imprecisions in devices, including phase shift error, coupling coefficient error and photodetection noise [35]. The phase error of phase shifter δ_P can be modeled as a random Gaussian distributed G_P with expectation μ = 0 and standard deviation σ = σ_P, and can be added directly to the original phase θ and φ. The coupling coefficient error δ_C can be calculated by measuring the extinction ratio E of the MZI, where δ_C = 10^-E/20. The photodetection noise δ_D follows Gaussian distributed G_D (μ = 0, σ = σ_D), and the practical received output of the photodetector Ó is expressed as Ó = (1 + δ_D) O where O is the ideal output. We test the effects of these three errors on the performance of the TBONN consisting of one UML and photodetectors for a 2D maze and the test results are presented in Fig. 6. It can be obviously found that the path length error increases with the increasing of phase shift error σ_P and photodetection noise δ_D, and the decreasing of the extinction ratio E of the MZI. Specifically, the standard deviation of typical phase shift error is about σ_P= 0.05 rad, the typical extinction ratio E can reach 20 dB, and the standard deviation of typical photodetection noise is about σ_D = 0.05 as the black boxes marked in Fig. 6(a) and 6(b). These lead to the difference between the minimum of avg100steps and the ideal path length increasing to 4.1 and 3.7, respectively. Although these three errors would have certain impacts on the performance of the TBONN for the 2D maze, the minimum of avg100step can reach 11.5 by fine-tuning the hyper-parameters of the ODQN algorithm and retraining the TBONN as shown in Fig. 6(c). In other words, the effect of noise on the performance of the network can be eliminated by training, showing good robustness of our proposed TBONN and ODQN algorithm.

Fig. 6. When changing (a) phase shift error σ_P and extinction ratio E and (b) phase shift error σ_P and photodetection noise σ_D, the difference between the minimum of avg100step with the imprecise ONN chip and the ideal path length based on the 2D maze. (c) The step and avg100steps of the TBONN after fine-tuning the hyper-parameters of the ODQN algorithm and retraining the TBONN when phase shift error, photodetection noise and the extinction ratio are typical values.

Download Full Size | PDF

3.2 3D grid world

To further demonstrate the effectiveness of the ODQN algorithm and the TBONN for path planning tasks, we build a more complex 10 × 10 × 10 grid world as shown in Fig. 7(a). The agent can take up, down, left, right, front and back actions to explore the environment. Each episode will be terminated when the agent arrives at the target or its path length reaches 1000. Firstly, we construct a TBONN consisting of one UML and one photodetector array (l = 1). Here, the UML needs six input ports (n = 6). The TBONN with zero bias (b₄= 0, b₅= 0, b₆= 0), one bias (b₄= 1, b₅= 0, b₆= 0), two bias ports (b₄= 1, b₅= 1, b₆= 0) and three bias ports (b₄= 1, b₅= 1, b₆= 1) are optimized by the proposed ODQN algorithm and the calculated results are presented in Fig. 7(b). Here, N = 3000, α=0.001, γ=0.9, β=500 and ɛ=0.98, respectively. As shown in Fig. 7(b), the step of the TBONN with three bias ports (blue solid line) can reach the shortest step length (19 in this case) after 100 episodes and the minimum of avg100steps (purple solid line) reaches 19.24, which proves effectiveness for the ODQN algorithm based on the TBONN in 3D grid world. However, the minimum of avg100steps of the TBONN with two bias ports (green solid line), one bias port (orange solid line) and zero bias (red solid line) are 19.26, 21.1 and 21.6, respectively. This suggests that more optical biases are introduced, higher performance of TBONNs can be achieved. To further illustrate the effect of the proposed ODQN algorithm, we build an ANN with an architecture of 6-6-6. It can be clearly seen from Fig. 7(c) that the step of the ANN (blue solid line) can reach the shortest path length after 100 episodes and the minimum of avg100steps (purple solid line) reaches 19.62, which is close to the minimum of avg100step of ODQN with 3 bias (19.24). Moreover, we test the computing time of the environment, the neural network and the decision maker when the number of inference is 10000, as the histogram shown in the inset. It can be found that the computing time of electric neural networks accounts for 77.8% of the total time. In other words, compared with ANNs, the speed of our proposed TBONN to complete the 3D grid world task can be increased by 4.5 times, which is larger than that in the 2D grid world task (2.5 times). Notably, it can be further inferred that the acceleration can be more noticeable when TBONN is applied to more complex tasks.

Fig. 7. (a) The 10 × 10 × 10 grid world environment. The red, blue and green spheres represent cuthe rrent position of an agent, the starting position of an agent at P_A = (3, 1, 4) and the destination at position P_R = (9, 9, 9), respectively. The cuboids represent obstacles. (b) The step and avg100steps of the TBONN with zero bias (b_n-₂= 0, b_n-₁= 0, b_n= 0 and n = 6), one bias (b_n-₂= 1, b_n-₁= 0 and b_n= 0), two bias ports (b_n-₂= 1, b_n-₁= 1 and b_n= 0) and three bias ports (b_n-₂= 1, b_n-₁= 1 and b_n= 1) in the 10 × 10 × 10 grid world. (c) The step and avg100steps of the ANNs in the 10 × 10 × 10 grid world. Here, the histogram of the inset shows the computing time of the environment, the ANN and the decision maker. (d) The step and avg100steps of the TBONN which can implement the arbitrary matrix (H). (e) Effects of phase shift error σ_P and extinction ratio E. (f) Impacts of phase shift error σ_P and photodetection noise σ_D.

Download Full Size | PDF

Then, we built a 6 × 6 ports TBONN which can implement the arbitrary matrix (H). The TBONN with three bias ports is trained by the ODQN algorithm based on the 10 × 10 × 10 maze and the calculated results are presented in Fig. 7(d). It can be found that the step (blue solid line) can reach the shortest step length after 300 episodes and the minimum of avg100steps (purple solid line) can reach 19. This suggests that TBONN with one UML can achieve similar results to the TBONN with a complete layer for 3D grid world tasks.

Last, we test the effects of phase shift error, the coupling coefficient error and photodetection noise on the performance of the six-port TBONN consisting of one UML and photodetectors based on the 3D maze. It can be found that the imprecisions of ONN chip brings more significant effect on the performance of 3D maze than that of the 2D maze. With the typical phase shift error, extinction ratio and photodetection noise as the black boxes marked in Fig. 7(e) and 7(f), the difference between the minimum of avg100steps and the ideal path length is 662.4 and 18.9, respectively. This indicates that in practical experiments, these three errors would have greater impacts on the performance of the TBONN based on the 3D maze. However, this impact can also be eliminated by on-chip training.

4. Conclusion

In conclusion, by studying the underlying mechanism of MZI-based ONNs, we propose a TBONN with only one unitary matrix layer, which can improve the utilization rate of the MZIs, increase the trainable weights of the network and have more powerful representational capacity, and analyze the basic principle of optical bias. Moreover, we use two classification instances to prove that TBONN can achieve higher performance by adding more biases and adding all optical biases to the same side beside the inputted signals. For the two-dimensional dataset, the average prediction accuracy of TBONN with 2 bias (97.1%) is 5% higher than that of TBONN with 0 bias (92.1%). Further, we introduce a novel ODQN algorithm to design and train TBONN to accelerate calculation based on path planning tasks. To demonstrate the effectiveness of the algorithm, the TBONN is applied in the 2D and 3D grid world tasks, and the differences between the minimum of avg100steps and the ideal value are both less than 1 in these two tasks. The calculated results demonstrate that our proposed algorithm is competitive with the conventional DQN algorithm based on ANNs, but can accelerate the computation speed by 2.5 times and 4.5 times for 2D and 3D grid world, respectively. Moreover, we also demonstrate that TBONN with one UML can achieve similar results to the TBONN with a complete layer for path planning tasks. Finally, we analyze the noise errors of the TBONNs and ODQN algorithm, and prove that the effect of noise on network performance can be eliminated by training, showing good robustness of the TBONN and ODQN algorithm. Notably, our proposed TBONN combined with ODQN shows excellent potential for deep reinforcement learning tasks.

Funding

National Natural Science Foundation of China (62171055, 62135009); Fundamental Research Funds for the Central Universities (ZDYY202102); Fund of State Key Laboratory of Information Photonics and Optical Communications (BUPT) (IPOC2020ZT08, IPOC2020ZT03); BUPT Innovation and Entrepreneurship Support Program (2021-YC-A319).

Disclosures

The authors declare no conflicts of interest.

Data availability

Data underlying the results presented in this paper are not publicly available at this time but may be obtained from the authors upon reasonable request.

References

1. Y. LeCun, Y. Bengio, and G. Hinton, “Deep learning,” Nature 521(7553), 436–444 (2015). [CrossRef]

2. F. Fan, J. Xiong, M. Li, et al., “On interpretability of artificial neural networks: A survey,” IEEE Trans. Radiat. Plasma Med. Sci. 5(6), 741–760 (2021). [CrossRef]

3. M. Sheykhmousa, M. Mahdianpari, H. Ghanbari, et al., “Support vector machine versus random forest for remote sensing image classification: A meta-analysis and systematic review,” IEEE J. Sel. Top. Appl. Earth Observations Remote Sensing 13, 6308–6325 (2020). [CrossRef]

4. G. Hinton, L. Deng, D. Yu, et al., “Deep neural networks for acoustic modeling in speech recognition: the shared views of four research groups,” IEEE Signal Process. Mag. 29(6), 82–97 (2012). [CrossRef]

5. T. Ferreira de Lima, B. J. Shastri, A. N. Tait, et al., “Progress in neuromorphic photonics,” Nanophotonics 6(3), 577–599 (2017). [CrossRef]

6. Y. Shen, N. C. Harris, S. Skirlo, et al., “Deep learning with coherent nanophotonic circuits,” Nat. Photonics 11(7), 441–446 (2017). [CrossRef]

7. X. Xu, M. Tan, B. Corcoran, et al., “11 TOPS photonic convolutional accelerator for optical neural networks,” Nature 589(7840), 44–51 (2021). [CrossRef]

8. X. Lin, Y. Rivenson, N. T. Yardimci, et al., “All-optical machine learning using diffractive deep neural networks,” Science 361(6406), 1004–1008 (2018). [CrossRef]

9. T. Yan, J. Wu, T. Zhou, et al., “Fourier-space diffractive deep neural network,” Phys. Rev. Lett. 123(2), 023901 (2019). [CrossRef]

10. T. Zhang, J. Wang, Y. Dan, et al., “Efficient training and design of photonic neural network through neuroevolution,” Opt. Express 27(26), 37150–37163 (2019). [CrossRef]

11. H. Bagherian, S. Skirlo, Y. Shen, et al., “On-chip optical convolutional neural networks,” arXiv, arXiv:1808.03303 (2018). [CrossRef]

12. J. Gu, C. Feng, Z. Zhao, et al., “Efficient on-chip learning for optical neural network networks through power-aware sparse zeroth-order optimization,” in Conference on artificial intelligence (AAAI, 2021), pp. 7583–7591.

13. Z. Fan, J. Lin, J. Dai, et al., “Photonic Hopfield neural network for the Ising problem,” Opt. Express 31(13), 21340–21350 (2023). [CrossRef]

14. Y. Tian, Y. Zhao, S. Liu, et al., “Scalable and compact photonic neural chip with low learning-capability-loss,” Nanophotonics 11(2), 329–344 (2022). [CrossRef]

15. H. Deng and M. Khajavikhan, “Parity–time symmetric optical neural networks,” Optica 8(10), 1328–1333 (2021). [CrossRef]

16. T. W. Hughes, M. Minkov, Y. Shi, et al., “Training of photonic neural networks through in situ backpropagation and gradient measurement,” Optica 5(7), 864–871 (2018). [CrossRef]

17. Y. Feng, J. Niu, Y. Zhang, et al., “Optical Neural Networks for Holographic Image Recognition,” Prog. Electromagn. Res. 176, 25–33 (2023). [CrossRef]

18. Y. Zhu, G. L. Zhang, B. Li, et al., “Countering variations and thermal effects for accurate optical neural networks,” in ACM International Conference on Computer Aided Design (IEEE, 2020), pp. 1–7.

19. F. Flamini, A. Hamann, S. Jerbi, et al., “Photonic architecture for reinforcement learning,” New J. Phys. 22(4), 045002 (2020). [CrossRef]

20. L. Liu, S. Lu, R. Zhong, et al., “Computing systems for autonomous driving: State of the art and challenges,” IEEE Internet Things J. 8(8), 6469–6486 (2021). [CrossRef]

21. H. Nguyen and H. La, “Review of deep reinforcement learning for robot manipulation,” in Third International Conference on Robotic Computing (IEEE, 2019), pp. 590–595.

22. C. L. Lawson and R. J. Hanson, Solving Least Squares Problems (SIAM, 1995).

23. F. Shokraneh, M. S. Nezami, and O. Liboiron-Ladouceur, “Theoretical and experimental analysis of a 4× 4 reconfigurable MZI-based linear optical processor,” J. Lightwave Technol. 38(6), 1258–1267 (2020). [CrossRef]

24. W. R. Clements, P. C. Humphreys, B. J. Metcalf, et al., “Optimal design for universal multiport interferometers,” Optica 3(12), 1460–1465 (2016). [CrossRef]

25. C. Ding, S. Liao, Y. Wang, et al., “Circnn: accelerating and compressing deep neural networks using block-circulant weight matrices,” in 50th Annual IEEE/ACM International Symposium on Microarchitecture (IEEE, 2017), pp. 395–408.

26. M. Xi, J. Yang, J. Wen, et al., “Comprehensive ocean information-enabled AUV path planning via reinforcement learning,” IEEE Internet Things J. 9(18), 17440–17451 (2022). [CrossRef]

27. O. I. Abiodun, A. Jantan, A. E. Omolara, et al., “State-of-the-art in artificial neural network applications: A survey,” Heliyon 4(11), e00938 (2018). [CrossRef]

28. I. A. Williamson, T. W. Hughes, M. Minkov, et al., “Reprogrammable electro-optic nonlinear activation functions for optical neural networks,” IEEE J. Select. Topics Quantum Electron. 26(1), 1–12 (2020). [CrossRef]

29. A. Autere, H. Jussila, Y. Dai, et al., “Nonlinear optics with 2D layered materials,” Adv. Mater. 30(24), 1705963 (2018). [CrossRef]

30. L. Vivien, A. Polzer, D. Marris-Morini, et al., “Zero-bias 40Gbit/s germanium waveguide photodetector on silicon,” Opt. Express 20(2), 1096–1101 (2012). [CrossRef]

31. B. Bartlett, “Neuroptica: An optical neural network simulator” (2019), https://github.com/fancompute/neuroptica.

32. C. Yi and M. Qi, “Research on virtual path planning based on improved DQN,” in International Conference on Real-time Computing and Robotics (IEEE, 2020), pp. 387–392.

33. T. T. Mac, C. Copot, D. T. Tran, et al., “A hierarchical global path planning approach for mobile robots based on multi-objective particle swarm optimization,” Applied Soft. Computing 59, 68–76 (2017). [CrossRef]

34. C. Henkel, A. Bubeck, and W. Xu, “Energy efficient dynamic window approach for local path planning in mobile service robotics,” IFAC-PapersOnLine 49(15), 32–37 (2016). [CrossRef]

35. R. Shao, G. Zhang, and X. Gong, “Generalized robust training scheme using genetic algorithm for optical neural networks with imprecise components,” Photonics Res. 10(8), 1868–1876 (2022). [CrossRef]

Tunable-bias based optical neural network for reinforcement learning in path planning

Abstract

1. Introduction

2. Theory of the proposed TBONN

3. Method of the proposed TBONNs for reinforcement learning

3.1 2D grid world

3.2 3D grid world

4. Conclusion

Funding

Disclosures

Data availability

References

Data availability

Cited By

Figures (7)

Equations (6)

Optics Express