Realization of an integrated coherent photonic platform for scalable matrix operations

Sadra Rahimi Kari; Nicholas A. Nobile; Dominique Pantin; Vivswan Shah; Nathan Youngblood

doi:10.1364/OPTICA.507525

1. INTRODUCTION

The growing demand for artificial intelligence (AI) applications and cloud services has highlighted the need for more efficient computing systems to handle increasingly complex workloads. For example, a recent study found that the compute usage needed to run inference on top performing deep neural networks is doubling at twice the rate compared to advances in hardware efficiency [1]. As AI continues to expand its capabilities and find widespread applications in areas such as healthcare [2] and cloud services (e.g., ChatGPT [3], Midjourney AI [4], Dall.E [5]), the limited efficiency of conventional computing hardware necessitates increased investment in hardware parallelism—an increasingly unsustainable solution [6]. Thus, there is an outstanding challenge to increase both computational throughput and efficiency to meet the growing global demand for computation [7,8].

Optical computing in the analog domain has shown promise to boost efficiency and throughput for AI applications. The majority of these approaches seek to improve efficiency and speed by employing weight-stationary techniques—i.e., static weights commonly encoded on-chip using tunable microring resonators [9,10], phase-change materials [11–13], or Mach–Zehnder interferometers (MZIs) [14–16]—and dynamic optical inputs to perform matrix-vector operations [17]. While this approach reduces data movement by maximizing reuse of trained weights, it encounters limitations in terms of scalability. The significant disparity in physical size between photonic and electronic components poses a challenge for weight stationary architectures, often requiring continuous reconfiguration of the weight matrix during computation [18–20], which can result in efficiencies much worse than conventional digital hardware.

To address this physical scaling challenge, temporally encoded output-stationary approaches have emerged as a promising alternative [21], where both inputs and weights are streamed in time using either coherent [22] or incoherent [23] optical signals. In these time-based approaches, the number of photonic elements no longer places an upper bound on the maximum dimension of the matrix operation, drastically improving both scalability and efficiency at the cost of increased latency. While these temporal approaches are highly promising, a scalable on-chip solution leveraging a fully coherent, output-stationary architecture that can process both signed and complex-valued inputs has yet to be demonstrated. Our integrated approach to coherent, time-multiplexed optical computing leveraging single-mode silicon waveguides addresses the practical challenges of prior theoretical [21] and experimental [22] works, which use free-space components. These free-space architectures require both precise and stable control over the multiple optical path-lengths in the system to ensure temporal coherence, as well as custom diffractive optical elements to ensure spatial coherence. Additionally, unlike prior weight-stationary coherent architectures such as [24] that allow for computation in the complex plane, our time-multiplexed approach allows for flexible weight updates and large-dimensional vector operations, which are not constrained by the physical size of the photonic hardware.

Fig. 1. Design of an integrated dot-product unit cell (DPUC). (a) Illustration of multiplication between two optical inputs through interference on a 50:50 beam splitter and homodyne detection. Temporal integration of the resulting homodyne signal provides accumulation necessary to calculate the dot-product between two time-multiplexed optical inputs. (b) Illustration of processing the correlation between two stochastic bit streams using an on-chip directional coupler and homodyne detection. Temporal integration and averaging computes the uncentered covariance between the two bit streams in real time. (c) Schematic and (d) microscope image of the coherent DPUC, providing an overview of the fundamental building blocks and the operation of the device. Thermal phase shifters are used to control the amplitude and phase of the two optical inputs.

Download Full Size | PDF

In this paper, we experimentally demonstrate and characterize a coherent dot-product unit cell (DPUC) as well as a $4 \times 4$ coherent crossbar array—two of the core components necessary for our recently proposed computing platform for large-scale matrix-matrix multiplication [20]. By controlling both the amplitude and phase of our optical inputs, we demonstrate data encoding in both the real and complex planes with up to 5 bits of resolution. We then use these results to compute the real-valued dot-product between two 64-element input vectors as well as element-wise multiplication between complex numbers. Next, we demonstrate correlation detection on two stochastic bits streams with an accuracy of 99% using our DPUC. Finally, we experimentally demonstrate a pathway to future scaling to two dimensions through free-space coupling and measurement with a near-infrared image sensor.

2. DEVICE DESIGN AND WORKING PRINCIPLE

The dot-product between two time-varying optical inputs can be computed using a 50:50 beam splitter and homodyne detection as previously proposed by Hamerly et al. [21] [illustrated in Fig. 1(a)]. Assuming the two inputs are both spatially and temporally coherent, meaning they are phase-matched, single-mode, and possess the same polarization, the detected homodyne signal can be written [20]:

(1)$${P}_{+}\!\left(t\right)-{P}_{-}\!\left(t\right)=2 \cdot \mathrm{R}\mathrm{e}\!\left[{E}_{x}^{\mathrm{*}}\!\left(t\right){E}_{y}\!\left(t\right)\right]{\sin}\!\left(\Delta {\varphi}_{\textit{xy}}\!\left(t\right)\right),$$

where ${E}_{x}(t)$ and ${E}_{y}(t)$ are the E-field amplitudes of input signals $x$ and $y$, and $\Delta {\varphi}_{\textit{xy}}(t)$ is the relative phase difference between the two inputs. The dot-product between the two input vectors is then directly proportional to the time-integrated differential photocurrent obtained from balanced detection:

(2)$${I}_{\text{PC}}=\frac{2}{\tau}\cdot \frac{\eta e}{hv}{\int}_{0}^{{N\tau}}{E}_{x}\!\left(t\right){E}_{y}\!\left(t\right){\sin}\!\left(\Delta {\varphi}_{\textit{xy}}\!\left(t\right)\right){\rm d}t \propto \mathop {x}\limits^\rightharpoonup\cdot \mathop {y}\limits^\rightharpoonup=\sum _{i=1}^{N}{x}_{i}{y}_{i},$$

where ${I}_{\text{PC}}$ is the time-integrated photocurrent, $N$ is the number of pulses (i.e., length of the input vectors), $\tau$ denotes the pulse period of each pulse, and $R=\frac{\eta e}{hv}$ is the responsivity of the photodetectors with quantum efficiency $\eta$, electron charge $e$, and photon energy $h\upsilon$ [20,25]. If temporal averaging is used in place of integration, the time-averaged photocurrent will be proportional to the uncentered covariance (${\rm cov}_{\textit{XY}}$) between the two input signals:

(3)$$\begin{split}\langle{I}_{\text{PC}}\rangle & = \frac{2}{N\tau}\cdot \frac{\eta e}{hv}{\int}_{0}^{{N\tau}}{E}_{x}\!\left(t\right){E}_{y}\!\left(t\right){\sin}\!\left(\Delta {\varphi}_{\textit{xy}}\!\left(t\right)\right){\rm d}t\propto {\rm cov}_{\textit{XY}}\\ & = \frac{1}{N}\sum _{i=1}^{N}{x}_{i}{y}_{i}.\end{split}$$

For the special case where the expectation values of the two input signals are zero (i.e., ${\mu}_{X},{\mu}_{Y}=0$), the covariance is centered and related to the correlation by

(4)$${\rho}_{\textit{XY}}=\frac{{\rm cov}_{\textit{XY}}}{{\sigma}_{X}{\sigma}_{Y}},\,\mathrm{where}\,{\sigma}_{A}=\sqrt{\frac{1}{N}\sum _{i=1}^{N}{a}_{i}^{2}},$$

where ${\rho}_{\textit{XY}}$ is the correlation and ${\sigma}_{X},{\sigma}_{Y}$ are the standard deviations of the two input signals. Because this approach uses homodyne detection to compute the product of the field amplitudes, common mode noise present in signals ${E}_{x}(t)$ and ${E}_{y}(t)$ is canceled, and optical noise in the system can reach the quantum limit [26]. Additionally, temporal integration reduces the minimum optical power required to achieve a given bit precision by a factor of $1/N$, resulting in operations that can achieve less than one photon per multiply-accumulate operation [23,27].

Fig. 2. Characterization and experimental resolution of integrated DPUC for both real and complex data encodings. (a) Transmission of MZI modulators versus applied current. The blue and red dots correspond to the measured transmission of each MZI, while the blue and red dashed lines correspond to Eq. (5). (b) Measured phase versus current of both phase shifters used to control the relative phase difference between the two input signals. (c), (d) Amplitude modulation measured at test ports of both MZIs showing linear control of the amplitude. Both MZIs suffer from small deviations at small amplitudes due to fabrication imperfections but exhibit high accuracy for amplitudes $\gt 0.1$. (e) Demonstration of 3-bit amplitude and 6-bit phase control of the optical signal mapped to the complex plane.

Download Full Size | PDF

The same operation can be performed on-chip by replacing the 50:50 beam splitter with a 3 dB directional coupler as shown in Fig. 1(b). The amplitude and phase of the two inputs are controlled using two thermo-optic MZI modulators and phase shifters, respectively, as illustrated in Fig. 1(c). To ensure spatial coherence, single-mode TE waveguides are used. Temporal coherence is also ensured by using a single optical source, which is then split on-chip using a y-splitter and sent to both modulator branches. To facilitate calibration, two test ports are included to monitor the output of each MZI. An optical image of the final device fabricated by Applied Nanotools on 220 nm SOI wafers is shown in Fig. 1(d).

An 8-channel current source with 16-bit resolution per channel (UEI DAQ AO-308-020) was used to modulate the amplitude and phase of both input signals. To account for fabrication differences between the four thermo-optic phase shifters, each amplitude/phase modulator pair were characterized and fit to the following empirical model [28]:

(5)$${P}_{\text{MZI}}\!\left({I}_{A}\right)=\frac{1}{2}+\frac{1}{2}{\cos}\!\left(A{{I}_{A}}^{2}+B{I}_{A}+{\varphi}_{A}\right),$$

(6)$${\varphi}_{\text{Phase Shifter}}\!\left({I}_{Q}\right)=D{{I}_{Q}}^{2}+E{I}_{Q}+{\varphi}_{Q}.$$

Here, ${I}_{A}$ and ${I}_{Q}$ denote the currents applied to the amplitude and phase modulators, respectively, while $A$, $B$, $D$, and $E$ represent the coefficients acquired through numerical fitting. ${\varphi}_{A}$ and ${\varphi}_{Q}$ represent the fixed phase offset between the two arms within the MZIs as well as their relative phase difference after fabrication. Figure 2(a) shows the normalized transmission as a function of applied current for the two MZIs measured at test ports 1 and 2, while the current-phase relationship from the thermo-optic phase shifters is shown in Fig. 2(b). To characterize the phase shifters after the MZIs, the output of each MZI was maximized, and the interference at the 3 dB coupler was measured as a function of current applied to either phase shifter.

A. Real and Complex Number Representation

One notable advantage of using coherent signaling for optical computing lies in the ability to encode both negative and positive numbers for both the inputs and the weights. With our coherent DPUC, negative numbers are encoded by applying a $\pi$ phase shift via independent phase shifters positioned after the MZIs. Thus, for an MZI modulator capable of $n$-bit amplitude resolution, this approach allows us to encode up to ${2}^{n+1}$ signed numbers within the range of $[-1,1]$. Figures 2(c) and 2(d) show the amplitude of the transmitted light for both MZIs measured at the output test ports for 32 current levels (5-bit resolution). We observe that both modulators show reduced accuracy for small amplitudes $\lt 0.1$. This is likely due to imperfect splitting at the $2\times 2$ output coupler of each MZIs (note that MZI 1 has better performance at low amplitudes than MZI 2, likely indicating fabrication variations between the two output couplers). However, beyond an amplitude of $\gt 0.1$, the amplitude of both modulators agrees well with the ideal encoded amplitude.

If we extend the precision of phase control beyond simply 0 and $\pi$, we can encode arbitrary numbers within the unit circle in the complex plane [Fig. 2(e)]. For this measurement, the output of MZI 2 was fixed at a constant amplitude of 1 and phase of 0, while modulating both the amplitude and phase of MZI 1. Extracting the complex number encoded by MZI 1 is achieved by measuring the in-phase and quadrature components after the 3 dB coupler [29]. While a 90-deg optical hybrid can be used to recover the in-phase and quadrature components simultaneously [30], we use a two-step process to recover the complex value in our DPUC. In the first step, the real (cosine) component is measured to extract the magnitude of the complex number. In the second step, a $\frac{\pi}{2}$ phase shift is applied to one of the inputs to the 3 dB coupler to measure the imaginary (sine) component as shown in the following equations:

(7)$$\tilde{z}=a+ib=\left|z\right|{\cos}\!\left({\varphi}_{z}\right)+i\left|z\right|{\sin}\!\left({\varphi}_{z}\right)=\left|z\right|{e}^{i{\varphi}_{z}},$$

(8)$$a=\left|z\right|{\cos}\!\left({\varphi}_{z}\right)=\left|z\right|{\sin}\!\left({\varphi}_{z}+\frac{\pi}{2}\right),b=\left|z\right|{\sin}\!\left({\varphi}_{z}\right),$$

(9)$$\left|z\right|=\sqrt{{a}^{2}+{b}^{2}},\quad {\varphi}_{z}=\mathrm{atan}\!\left(\frac{b}{a}\right),$$

where $|z|$ and ${\varphi}_{z}$ are the magnitude and phase of the complex number $\tilde{z}$. Complex number encoding increases the maximum bit precision of our system by a factor of $2\times $ (e.g., from 3 bits to 6 bits) for a given signal-to-noise ratio. This also increases the dimensionality of the input data and trained weight, which could have potential benefits for the scalability and generality of photonic AI processing [24].

3. RESULTS

A. Analog Photonic Multiplication in Real and Complex Spaces

Using our DPUC, we demonstrate sequential multiplication between both real- and complex-valued inputs, which can then be summed through temporal integration. To demonstrate this capability for real numbers, a multiplication sweep was performed by systematically varying the two inputs from −1 to ${+}1$ and measuring their product through homodyne detection. The resulting outputs are presented in Fig. 3(a) for the multiplication between two 4-bit real numbers (i.e., 3 bits of resolution in amplitude and 1 bit in phase). By analyzing the error between expected and measured input amplitudes [denoted by the length of the black lines in Fig. 3(a)] using the two MZI test ports, we calculate the effective bit precision of Input A and Input B to be 4.6 and 3.7, respectively, using the formula given in Eq. (12). It is noticeable from Fig. 3(a) that achieving inputs close to 0 can be challenging because inputs are encoded in the amplitude of the electric fields rather than the intensity of the optical signal. Thus, for zero valued inputs, an intensity error of 1% (an MZI extinction ratio of ${-}20\; {\rm dB}$) corresponds to a 10% error in the E-field amplitude. Figure 3(b) plots the accuracy of the measured product between inputs $A$ and $B$ from the data shown in Fig. 3(a) with a standard deviation of 0.033 (corresponding to a precision of $\sim 3.9$ bits [31]). Near zero we again observe a larger error in the measured product since we are limited by the accuracy of our MZIs when encoding values $\lt 0.1$ [see Figs. 2(c) and 2(d)]. Increasing the extinction ratio of our MZIs by improving the fabrication tolerance of the 3 dB couplers and ensuring equal loss in both waveguides is expected to improve the encoded precision at small values [32].

Fig. 3. DPUC experimental results for real number multiplications and dot-product operations. (a) 2D contour map of the product between two input signals. The highlighted points (white) indicate the measured input values, while the color of each contour region encodes the corresponding multiplication result. Black dots and solid black lines indicate the deviation of measured input values relative to their ideal position. (b) Measured (red dots) versus ideal (dashed line) amplitude of the multiplication product from (a) demonstrating a measured precision of 3.9 bits. (c) Two randomly generated input vectors with 64 elements and 4-bit resolution. The complements of the two measured test port signal intensities (proportional to the square of the inputs) are shown in red and blue, while the black dashed line represents the expected amplitude for each input. (d) Differential output signal (multiplication between signals A and B) as recorded via the two output ports of the 3 dB coupler (blue). (e) Post-processed time-integrated output signal representing the dot-product between the two input signals in (c). The observed results follow the ideal integrated signal (dashed line), with an RMSE of 0.09 (an effective precision of 3.8 bits).

Download Full Size | PDF

For dot-product operations, two random analog streams (4-bit signed numbers) were generated at a clock rate of 1 kHz, each with a length of 64 samples. The input signal intensities for $A$ and $B$ measured using the complementary signal from the two test ports are shown in Fig. 3(c) with their respective ideal values (black dashed lines). The real-time output of the DPUC is presented in Fig. 3(d) together with the ideal homodyne signal (black dashed line). We observe good agreement between the measured and ideal signals apart from certain values close to zero where we are limited by the extinction ratio of the MZIs. Figure 3(e) shows the time integrated signal after data acquisition, with the final result corresponding to the dot-product of the two 64-element input vectors. The achieved results agree well with the ideal dot-product, with a Root Mean Square Error (RMSE) of 0.09. This corresponds to an effective precision of 3.8 bits, in good agreement with the results of our scalar multiplication measurements shown in Fig. 3(b). We note that small errors in multiplication can have a cumulative effect on the final output of the dot-product. For lower precision measurements where the precise levels are more accurately defined (e.g., covariance between two stochastic bit-streams), we observe much better accuracies as shown in our correlation results.

The same approach can be extended to the complex plane for both multiplication and dot-product operations by encoding the vectors $\mathop {x}\limits^\rightharpoonup$ and $\mathop {y}\limits^\rightharpoonup$ in the time-varying amplitudes and phases of two interfering electric fields ${\mathop {\boldsymbol E}\limits^\rightharpoonup}_{x}(t)=|{E}_{x}(t)|{e}^{i{\varphi}_{x}(t)}$ and ${\mathop {\boldsymbol E}\limits^\rightharpoonup}_{y}(t)=|{E}_{y}(t)|{e}^{i{\varphi}_{y}(t)}$. When incident on a 90-deg optical hybrid (or by using two consecutive measurements with a $\pi /2$ phase shift and 3 dB coupler), the difference in photocurrent integrated over time from Eq. (2) is modified to the following:

(10)$$\begin{split} {I}_{\text{PC}} & = \frac{2}{\tau}\cdot \frac{\eta e}{hv}\!\left[{\int}_{0}^{{N\tau}}\left|{E}_{x}\!\left(t\right)\right|\left|{E}_{y}\!\left(t\right)\right|{\cos}\!\left(\Delta {\varphi}_{\textit{xy}}\!\left(t\right)\right){\rm d}t\right. \\ & \left.\quad +\;i{\int}_{0}^{{N\tau}}\left|{E}_{x}\!\left(t\right)\right|\left|{E}_{y}\!\left(t\right)\right|{\sin}\!\left(\Delta {\varphi}_{\textit{xy}}\!\left(t\right)\right){\rm d}t\right], \\ {I}_{\text{PC}} & \propto \mathop {x}\limits^\rightharpoonup\cdot \mathop {y}\limits^\rightharpoonup=\sum _{i=1}^{n}\left|{x}_{i}\right|\left|{y}_{i}\right|{e}^{i\!\left({\varphi}_{x}+{\varphi}_{y}\right)}.\end{split}$$

By fully leveraging the optical phase and amplitude, MAC operations can be performed on complex numbers in either one or two clock cycles, depending on the measurement scheme. Compared to the digital domain where complex numbers are stored separately as real and imaginary components (i.e., $\tilde {z}=x+iy$), the computation $\mathop {x}\limits^\rightharpoonup\cdot \mathop {y}\limits^\rightharpoonup$ requires four separate dot-products:

(11)$$\begin{split}\mathop {x}\limits^\rightharpoonup\cdot \mathop {y}\limits^\rightharpoonup & = \left[\mathrm{R}\mathrm{e}\!\left(\mathop {x}\limits^\rightharpoonup\right)\cdot \mathrm{R}\mathrm{e}\!\left(\mathop {y}\limits^\rightharpoonup\right)-\mathrm{I}\mathrm{m}\!\left(\mathop {x}\limits^\rightharpoonup\right)\cdot \mathrm{I}\mathrm{m}\!\left(\mathop {y}\limits^\rightharpoonup\right)\right]\\ &\quad +i\left [\mathrm{R}\mathrm{e}\!\left(\mathop {x}\limits^\rightharpoonup\right)\cdot \mathrm{I}\mathrm{m}\!\left(\mathop {y}\limits^\rightharpoonup\right)+\mathrm{I}\mathrm{m}\!\left(\mathop {x}\limits^\rightharpoonup\right)\cdot \mathrm{R}\mathrm{e}\!\left(\mathop {y}\limits^\rightharpoonup\right)\right].\end{split}$$

Thus, for each complex MAC operation, we are performing the equivalent of 8 digital operations.

To showcase the potential of the DPUC for performing complex operations, we conducted a similar experiment as described above for real-number multiplication. Two random complex streams, each comprising 64 samples, were generated with 3 bits of resolution for amplitude and 6 bits of resolution for phase [Fig. 4(a)]. The real-time output of the DPUC, representing the real and imaginary components of the complex product between signals $A$ and $B$, are shown in Figs. 4(b) and 4(c) using two sequential homodyne measurements with a $\pi /2$ relative phase shift applied between measurements as described above. Using the real and imaginary components of the product in Figs. 4(b) and 4(c), we can calculate the magnitude and phase of the complex products and plot them against their ideal values [Figs. 4(d) and 4(e)]. The error in the measured magnitude and phase is shown in the insets of Figs. 4(d) and 4(e). By calculating the RMSE of these results, we estimate the precision of our complex multiplication to be 2.8 bits for the magnitude of the product and 3.5 bits for the phase using the following equation:

(12)$$\text{bit precision} ={{\log}}_{2}\!\left(\frac{{\max}\!\left({\text{output}}\right)-{\min}\!\left({\rm output}\right)}{\mathrm{R}\mathrm{M}\mathrm{S}\mathrm{E}\times k}\right),$$

where the range is defined by the difference between the maximum and minimum output values (either amplitude or phase), $\mathrm{R}\mathrm{M}\mathrm{S}\mathrm{E}$ is the root mean square error, and $k$ is a scaling factor that depends on the distribution of ideal output values (for uniformly distributed values, $k=\sqrt{12}$) [33].

Fig. 4. Complex number multiplication using a DPUC. (a) Complex-valued input data are randomly generated with 3-bit resolution and 6-bit resolution for the amplitude and phase, respectively. The measured intensities of the two input signals are shown as red and blue, while the encoded phase is represented by the fill color. (b), (c) Differential output signals recorded for the (b) real (blue) and (c) imaginary (red) components of the complex product. The dashed lines correspond to the ideal multiplication according to the input values from (a). (d), (e) Accuracy of measured (d) magnitude and (e) phase for the complex products shown in (b) and (c) compared to the ideal complex components of the product (dashed line). Outliers in (e) are due to measurement error when multiplying two complex numbers where one has an amplitude of zero.

Download Full Size | PDF

Fig. 5. Measuring covariance of random bit streams. (a) Random bit streams using phase encoding where binary “0” $\to \varphi =\pi$ and binary “1” $\to \varphi =0$. (b) Differential output homodyne signal (blue) compared to the ideal output expected (dashed line). When both signals are correlated, the differential output is ${+}1$, while for anticorrelated signals, the output is ${-}1$. (c) Comparison of the time-integrated differential signal (red) from (b) with the ideal output (dashed line). Dividing the final result provides the covariance between signals X and Y. (d) Measured covariance versus the ideal covariance (dashed line) for multiple experiments with different degrees of correlation.

Download Full Size | PDF

B. Correlation Detection between Stochastic Bit Streams

Real-time measurement of statistical correlations between event-based data streams is crucial for a variety of fields such as Internet of Things (IoT), networking, healthcare, and social sciences [34]. For example, in the case of IoT and networking, correlation detection can be used to quickly alert system administrators of an adversarial attack from network traffic patterns [35] or of a potential systems failure from anomalous events in IoT sensors [36,37]. Thus, there is a clear need to quickly identify correlations on dynamic data streams with low latency and high efficiency—especially for data already in the optical domain [38].

Here, we use our coherent architecture to demonstrate correlation detection between stochastic bit streams. For our experiment, we generated two random binary streams 64 bits in length using a phase modulation encoding scheme. Both MZIs were set to a constant value corresponding to their maximum transmission, and data were encoded using independent phase shifters. Thus, a binary value of “0” translates to an optical signal with a phase of $\varphi =\pi$, while a binary value of “1” is represented by an optical signal with a phase of $\varphi =0$. By integrating the differential output signal and dividing by the total integration time, we were able to calculate the covariance for the two input bit streams, which is directly proportional to the correlation between the bit streams according to Eq. (4).

Fig. 6. Scaling DPUCs to matrix-vector and matrix-matrix operations. (a) Example of a 1D array of DPUCs to implement the matrix-vector operation $\mathop {c}\limits^\rightharpoonup=\boldsymbol{B}\mathop {a}\limits^\rightharpoonup$. (b) Illustration of modified DPUC within a crossbar array. Directional couplers on the row and column bus waveguides enable compact fan-out in two dimensions. (c) A 2D array of DPUCs capable of performing the matrix-matrix multiplication $\boldsymbol{C}=\boldsymbol{A}\boldsymbol{B}$, where the rows of $\boldsymbol{A}$ and columns of $\boldsymbol{B}$ are time-multiplexed input signals routed to the rows and columns of the array. (d) Microscope image of a fabricated 2D array of DPUCs. Grating coupler pairs are used to couple the output power to an image sensor after the 3 dB coupler. (e) Captured image from the array in (d) using a near-IR image sensor. An input optical power of 10 µW is equally distributed to all row and column bus waveguides in (d) (fan-out network not shown). (f) Bar plot of optical power for each output grating coupler pair in (e) as measured by the near-IR image sensor. Neighboring pixels are grouped together to calculate the normalized optical power.

Download Full Size | PDF

Figure 5(a) shows the two input signals generated from random bit streams where a phase of $\varphi =\pi$ corresponds to a real number value of ${-}1$ and a phase of $\varphi =0$ corresponds to ${+}1$. The normalized differential photocurrent measured at the output of the 3 dB coupler can be seen in Fig. 5(b). Because we maintain constant amplitude and modulate the phase between $0$ and $\pi$, our accuracy is much higher than the results shown in Figs. 3 and 4 since we are not limited by the extinction ratio of the two MZIs. This higher accuracy in the differential signal translates to an improved accuracy in the time integrated output shown in Fig. 5(c), where our final result is within 3.6% of the ideal dot-product. Dividing the dot-product by the bit stream length ($N=64$) we find the covariance to be 0.278 compared to the ground truth of 0.25. We repeated this experiment for multiple randomly generated bit streams with a predefined correlation. The results are shown in Fig. 5(d), where we compare the measured covariance with the ideal covariance. For all experiments, we observe excellent agreement with an average RMSE of 0.0025.

C. Scalability

While our experiments have focused primarily on demonstrating a single unit cell capable of implementing a complex dot-product, it is important to consider scaling to 1D and 2D implementations. A linear array of such unit cells would enable matrix-vector multiplication (MVM), while a 2D array of these unit cells allows for general matrix-matrix multiplication (GEMM). An example of a 1D array of DPUCs capable of implementing MVM operations is illustrated in Fig. 6(a), where an input vector $\mathop {a}\limits^\rightharpoonup$ is multiplied with matrix $\boldsymbol{B}$, whose rows are encoded in vectors ${\mathop {b}\limits^\rightharpoonup}_{1}$ through ${\mathop {b}\limits^\rightharpoonup}_{4}$. Each DPUC implements the dot-product between $\mathop {a}\limits^\rightharpoonup$ and one of the rows of $\boldsymbol{B}$ resulting in the output vector $\mathop {c}\limits^\rightharpoonup$ with elements ${c}_{1}$ through ${c}_{4}$. To easily scale the DPUC to one- and two-dimensional circuits, we implement fan-out using a crossbar array structure, which we have previously proposed [20]. The revised DPUC is shown in Fig. 6(b) and includes two directional couplers as well as a waveguide crossing. This modifies Eq. (2) as follows:

(13)$${I}_{\text{PC}}=\frac{2{\kappa}_{i}{\kappa}_{j}}{\tau}\cdot \frac{\eta e}{hv}{\int}_{0}^{{N\tau}}{E}_{x}\!\left(t\right){E}_{y}\!\left(t\right){\sin}\!\left(\Delta {\varphi}_{\textit{xy}}\!\left(t\right)\right){\rm d}t,$$

where ${\kappa}_{i}$ and ${\kappa}_{j}$ are the cross-coupling coefficients of the two directional couplers. By tuning the coupling length of these directional couplers, optical inputs can be equally distributed to DPUCs within an array. However, a few key factors impact scalability and need to be addressed:

(1) Power distribution: Equal power distribution is important during the optical fan-out process for inputs to ensure that each individual DPUC within the crossbar array receives a similar amount of power. In a crossbar array of size $N \times N$, each unit cell receives $1/N$ of the power from each input modulator. As the size of the crossbar array increases, the minimum coupling ratio decreases. Fabrication process variations limit the precision of incremental lengths in directional couplers and become challenging in practice. However, precision in the fan-out of the optical signals is not necessary since this is simply a scaling factor that can be adjusted in a single post-processing step during readout after calibration of the array (we analyze this in detail in [20]). It is more important that each 3 dB coupler within the DPUC achieves an equal splitting ratio for accurate homodyne detection.
(2) Optical loss: Each waveguide crossing, directional coupler, and bend induces additional optical attenuation. Cascading many elements would reduce the optical power and require higher input powers at the source to achieve the same signal-to-noise ratio. This is fundamentally limited by the maximum source power and two-photon absorption in the silicon waveguides. Switching platforms to silicon nitride can minimize two-photon absorption and propagation loss at the cost of a larger unit cell size.
(3) Phase noise: Small phase errors can accumulate across long optical paths, degrading interference and signal accuracy. However, for fixed optical paths with minimal phase variation due to thermal fluctuations, it is possible to trim each unit cell to maximize the interference contrast using low-loss phase-change materials (e.g., ${\rm Sb}_2{\rm Se}_3$ [39]) and a one-time post-processing step after fabrication of the crossbar array.

Addressing these challenges through advances in low-loss integrated photonics, precise trimming with phase-change materials, and careful calibration procedures, we anticipate that crossbar arrays with dimensions of $64 \times 64$ or greater can be realized. We note that prior experimental demonstrations have successfully used directional couplers to achieve large 2D arrays in integrated photonic LIDAR platforms [40–42].

A two-dimensional array of DPUCs that implements the GEMM operation $\boldsymbol{C}=\boldsymbol{A}\boldsymbol{B}$ is illustrated in Fig. 6(c) with a microscope image of a fabricated prototype shown in Fig. 6(d). To simplify read-out of multiple DPUCs in parallel, we include two grating couplers within each DPUC and detect their output using a near-IR objective and image sensor (Raptor Photonics OWL-640-T). An example near-IR image of the $4 \times 4$ DPUC array in Fig. 6(d) is shown in Fig. 6(e) using random inputs from a CW laser held at a constant total power of 10 µW before coupling to the chip. To extract the matrix elements for $C$, clusters of pixels corresponding to each output grating coupler are summed to calculate the optical power at each output pair. Figure 6(f) plots the normalized output power of each grating coupler pair from the image in Fig. 6(e). The difference between the output powers of each grating pair provides the homodyne signal as before. In this way the pixels of the near-IR image sensor enable temporal integration to occur across the entire DPUC array for a duration controlled by the exposure time of the sensor.

Extending our DPUC concept to two dimensions allows efficient GEMM operations that could have a wide range of practical applications that require efficient computing. This platform would be particularly well suited for processing the linear (i.e., “fully connected”) layers in deep neural networks that typically have large dimensionality. The number of trained weights in such layers can easily exceed several million parameters per layer, which is far beyond the number of photonic memory cells that can fit on-chip—limiting the usefulness of weight-stationary approaches [20]. The ability to perform matrix-matrix multiplication in the time domain also allows one to batch many input vectors simultaneously and amortize the latency and energy cost of electro-optical modulation [21]. In addition to applications related to AI, two-dimensional arrays of our DPUC could also be used to perform real-time correlation between multiple digital or analog signals simultaneously. In this case, the input vectors ${\mathop {a}\limits^\rightharpoonup}_{i}$ and ${\mathop {b}\limits^\rightharpoonup}_{i}$ shown Fig. 6(c) would share a common modulator (i.e., ${\mathop {a}\limits^\rightharpoonup}_{i}={\mathop {b}\limits^\rightharpoonup}_{j}$ for $i=j$), such that the output of the crossbar array would be the covariance of matrices $\boldsymbol{A}$ and $\boldsymbol{B}$. Such a platform could be used to analyze RF signals or traffic in a datacenter without requiring additional digital conversion and signal processing.

While the latency of our proof-of-concept DPUC array is limited by the maximum frame rate of the image sensor, future implementations could use on-chip high-speed photodetectors and transimpedance amplifiers to further improve read-out speed [41]. Integrating on-chip detectors would lead to a more compact and integrated system, essential for practical applications, and would additionally reduce latency through high-speed readout circuitry. Integration also has the potential to reduce overall power consumption by improving the coupling efficiency between the waveguide and the detectors. However, on-chip detector integration increases the complexity of the system design and decreases the integration density of the proposed architecture. Additionally, waveguide integrated detectors also require advanced fabrication processes and custom electronic readout, which are beyond the scope of this work.

4. CONCLUSION

We have demonstrated the potential and scalability of a DPUC architecture for time-multiplexed, coherent optical computing applications on-chip. Our experiments showcased the capabilities of the DPUC for coherent multiply-accumulate operations in both the real and complex planes, as well as real-time correlation processing between stochastic bit streams. We have also demonstrated a strategy to combine multiple DPUCs within a linear and 2D array as well as perform parallel read-out of a full 2D array using an image sensor. This extends our approach beyond single vector-vector operations to full matrix-matrix multiplications on-chip. While our proof-of-concept platform is currently limited by the modulation speed of our thermal phase-shifters and read-out speed of our image sensor, future implementations leveraging high-speed electro-optic modulators at the input and fast on-chip balanced photodetectors at the output would address this (see [20] for more details). The combined benefits of temporal integration, quantum-limited homodyne detection, and on-chip integration through our crossbar array have the potential to enable efficient and large-scale GEMM operations for a diverse range of applications in AI.

APPENDIX A: METHODS

1. Measurement Setup

A tunable fiber laser (Santec TSL-550) was utilized to generate 1550 nm input light for the dot-product chip. An eight-channel fiber array was employed to couple light into and out of the chip via on-chip grating couplers. Modulation of the thermal phase shifters was achieved using an 8-channel, 16-bit current sourcing DAC (UEI DAQ DNA-AO-308-020). The output of the DAC was connected to a custom PCB, which provided connections to the photonic chip via wire bonds. For homodyne detection and monitoring the test ports, four free-space photodetectors (Newport 2011-FC) were connected to the output fibers with the photocurrent outputs connected to a Moku:Pro from Liquid Instruments to facilitate the capture and real-time recording of optical data from the chip.

2. Device Calibration

The dot-product unit cell design comprises two thermal Mach–Zehnder interferometers (MZIs) and two independent phase shifters in the arms. Prior to measurements, a calibration process was conducted. For each device, a DC current sweep was applied using the UEI DAQ. Two monitor ports and two differential output ports were simultaneously measured using four photodetectors. Subsequently, the analog outputs were digitized and recorded with a sampling frequency of 1 kHz. To minimize noise and enhance data reliability, 1000 samples were recorded and averaged for each sweep level. Finally, a numerical fitting was performed to model the transmission versus current relationship for both MZIs and the phase versus current relationship for both thermal phase shifters. The RMSE values calculated for MZI 1 and MZI 2 are 0.0021 and 0.0022, respectively. Through an analysis of the MZI test port errors, we determined the bit precision for setting the intensities of MZI 1 and MZI 2 to be 7.09 and 7.25, respectively. By performing the same analysis for the output errors of the DPUC, we calculated the bit precision for phase shifters 1 and 2 to be 3.65 and 3.7, respectively.

3. Power Analysis

Based on the experimental data depicted in Fig. 2(a), both Mach–Zehnder interferometers (MZIs) are configured to operate within the current range of 7.5–14.5 mA, ensuring optical modulation of the input amplitudes between the analog states 0 and 1. Considering an average current of 11 mA and 240 Ohm resistance of the thermo-optic phase shifters, each phase shifter requires approximately 29 mW of power during operation. With a 1 ms period for each input, a single MZI and accompanying phase shifter consumes approximately 58 µJ of energy per complex value encoded. This energy consumption is comparable to other optical modulators employing thermo-optic phase shifters. For instance, Refs. [43,44] report energy requirements of 12.7 mW and 1.47 nm/mW for thermal MZIs and thermal microring resonators (MRRs), respectively. To enhance energy efficiency, integration of high-speed silicon modulators can be explored. Notably, Refs. [45,46] introduce single P-N MRR and segmented MZI with energy efficiencies of 4.5 pJ/bit and 17.4 fJ/bit, respectively. These modulators can also include electro-optic digital-to-analog conversion through segmentation as well, improving linearity and efficiency [47].

Funding

National Science Foundation (2105972).

Acknowledgment

N.Y. acknowledges support from the University of Pittsburgh Momentum Fund.

Disclosures

The authors declare no conflicts of interest.

Data availability

Data underlying the results presented in this paper are not publicly available at this time but may be obtained from the authors upon reasonable request.

REFERENCES

1. R. Desislavov, F. Martínez-Plumed, and J. Hernández-Orallo, “Trends in AI inference energy consumption: beyond the performance-vs-parameter laws of deep learning,” Sustain. Comput. Inf. Syst. 38, 100857 (2023). [CrossRef]

2. B. Kalis, M. Collier, and R. Fu, Article Technology: 10 Promising AI Applications in Health Care (2018).

3. OpenAI, Introducing ChatGPT (OpenAI, 2022).

4. D. Holz, “Midjourney,” 2023, https://www.midjourney.com/.

5. A. Ramesh, M. Pavlov, G. Goh, et al., Zero-Shot Text-to-Image Generation (2021).

6. N. C. Thompson, K. Greenewald, K. Lee, et al., The Computational Limits of Deep Learning (2020).

7. C. Li, X. Zhang, J. Li, et al., “The challenges of modern computing and new opportunities for optics,” PhotoniX 2, 20 (2021). [CrossRef]

8. J. Végh, “Which scaling rule applies to large artificial neural networks,” Neural Comput. Appl. 33, 16847–16864 (2021). [CrossRef]

9. A. N. Tait, M. A. Nahmias, B. J. Shastri, et al., “Broadcast and weight: an integrated network for scalable photonic spike processing,” J. Lightwave Technol. 32, 4029–4041 (2014). [CrossRef]

10. M. J. Filipovich, Z. Guo, M. Al-Qadasi, et al., “Silicon photonic architecture for training deep neural networks with direct feedback alignment,” Optica 9, 1323–1332 (2022). [CrossRef]

11. W. Zhou, B. Dong, N. Farmakidis, et al., “In-memory photonic dot-product engine with electrically programmable weight banks,” Nat. Commun. 14, 2887 (2023). [CrossRef]

12. C. Wu, H. Yu, S. Lee, et al., “Programmable phase-change metasurfaces on waveguides for multimode photonic convolutional neural network,” Nat. Commun. 12, 96 (2021). [CrossRef]

13. J. Feldmann, N. Youngblood, M. Karpov, et al., “Parallel convolutional processing using an integrated photonic tensor core,” Nature 589, 52–58 (2021). [CrossRef]

14. Y. Shen, N. C. Harris, S. Skirlo, et al., “Deep learning with coherent nanophotonic circuits,” Nat. Photonics 11, 441–446 (2017). [CrossRef]

15. X. Xiao, M. B. On, T. Van Vaerenbergh, et al., “Large-scale and energy-efficient tensorized optical neural networks on III–V-on-silicon MOSCAP platform,” APL Photonics 6, 126107 (2021). [CrossRef]

16. S. K. Vadlamani, D. Englund, and R. Hamerly, “Transferable learning on analog hardware,” Sci. Adv. 9, eadh3436 (2023). [CrossRef]

17. X. Xu, M. Tan, B. Corcoran, et al., “11 TOPS photonic convolutional accelerator for optical neural networks,” Nature 589, 44–51 (2021). [CrossRef]

18. G. Yang, C. Demirkiran, Z. E. Kizilates, et al., “Processing-in-memory using optically-addressed phase change memory,” in IEEE/ACM International Symposium on Low Power Electronics and Design (ISLPED) (IEEE, 2023), pp. 1–6.

19. C. Demirkiran, F. Eris, G. Wang, et al., “An electro-photonic system for accelerating deep neural networks,” ACM J. Emerg. Technol. Comput. Syst. 19, 1–31 (2023). [CrossRef]

20. N. Youngblood, “Coherent photonic crossbar arrays for large-scale matrix-matrix multiplication,” IEEE J. Sel. Top. Quantum Electron. 29, 6100211 (2023). [CrossRef]

21. R. Hamerly, L. Bernstein, A. Sludds, et al., “Large-scale optical neural networks based on photoelectric multiplication,” Phys. Rev. X 9, 021032 (2019). [CrossRef]

22. Z. Chen, A. Sludds, R. Davis, et al., “Deep learning with coherent VCSEL neural networks,” Nat. Photonics 17, 723–730 (2023). [CrossRef]

23. A. Sludds, S. Bandyopadhyay, Z. Chen, et al., “Delocalized photonic deep learning on the internet’s edge,” Science 378, 270–276 (2022). [CrossRef]

24. H. Zhang, M. Gu, X. D. Jiang, et al., “An optical neural chip for implementing complex-valued neural network,” Nat. Commun. 12, 457 (2021). [CrossRef]

25. A. N. Tait, “Quantifying power use in silicon photonic neural networks,” Phys. Rev. Appl. 17, 054029 (2021). [CrossRef]

26. N. G. Walker and J. E. Carroll, “Multiport homodyne detection near the quantum noise limit,” Opt. Quantum Electron. 18, 355–363 (1986). [CrossRef]

27. T. Wang, S.-Y. Ma, L. G. Wright, et al., “An optical neural network using less than 1 photon per multiplication,” Nat. Commun. 13, 123 (2022). [CrossRef]

28. X. Qiang, X. Zhou, J. Wang, et al., “Large-scale silicon quantum photonics implementing arbitrary two-qubit processing,” Nat. Photonics 12, 534–539 (2018). [CrossRef]

29. A. Khachaturian, R. Fatemi, and A. Hajimiri, “IQ photonic receiver for coherent imaging with a scalable aperture,” arXiv, arXiv:2108.10225 (2021). [CrossRef]

30. Y. Wang, X. Li, Z. Jiang, et al., “Ultrahigh-speed graphene-based optical coherent receiver,” Nat. Commun. 12, 5076 (2021). [CrossRef]

31. W. Zhang, C. Huang, H.-T. Peng, et al., “Silicon microring synapses enable photonic deep learning beyond 9-bit precision,” Optica 9, 579–584 (2022). [CrossRef]

32. K. Suzuki, G. Cong, K. Tanizawa, et al., “Ultra-high-extinction-ratio 2 × 2 silicon optical switch with variable splitter,” Opt. Express 23, 9086–9092 (2015). [CrossRef]

33. W. R. Bennett, “Spectra of quantized signals,” Bell Syst. Tech. J. 27, 446–472 (1948). [CrossRef]

34. A. Sebastian, T. Tuma, N. Papandreou, et al., “Temporal correlation detection using computational phase-change memory,” Nat. Commun. 8, 1115 (2017). [CrossRef]

35. W. Jin, L. Fang, and L. Wang, “Abnormal detection and correlation analysis of communication network traffic based on behavior,” J. Phys. Conf. Ser. 1648, 032087 (2020). [CrossRef]

36. P. Zhao, M. Kurihara, J. Tanaka, et al., “Advanced correlation-based anomaly detection method for predictive maintenance,” in IEEE International Conference on Prognostics and Health Management (ICPHM) (IEEE, 2017), pp. 78–83.

37. S. Zhong, H. Luo, L. Lin, et al., “An improved correlation-based anomaly detection approach for condition monitoring data of industrial equipment,” in IEEE International Conference on Prognostics and Health Management (ICPHM) (IEEE, 2016), pp. 1–5.

38. S. Ghazi Sarwat, F. Brückerhoff-Plückelmann, S. García-Cuevas Carrillo, et al., “An integrated photonics engine for unsupervised correlation detection,” Sci. Adv. 8, eabn3243 (2022). [CrossRef]

39. C. Ríos, Q. Du, Y. Zhang, et al., “Ultra-compact nonvolatile phase shifter based on electrically reprogrammable transparent phase change materials,” PhotoniX 3, 26 (2022). [CrossRef]

40. J. Sun, E. Timurdogan, A. Yaacobi, et al., “Large-scale nanophotonic phased array,” Nature 493, 195–199 (2013). [CrossRef]

41. C. Rogers, A. Y. Piggott, D. J. Thomson, et al., “A universal 3D imaging sensor on a silicon photonics platform,” Nature 590, 256–261 (2021). [CrossRef]

42. X. Zhang, K. Kwon, J. Henriksson, et al., “A large-scale microelectromechanical-systems-based silicon photonics LiDAR,” Nature 603, 253–258 (2022). [CrossRef]

43. M. R. Watts, J. Sun, C. DeRose, et al., “Adiabatic thermo-optic Mach–Zehnder switch,” Opt. Lett. 38, 733–735 (2013). [CrossRef]

44. A. H. Atabaki, A. A. Eftekhar, S. Yegnanarayanan, et al., “Sub-100-nanosecond thermal reconfiguration of silicon photonic devices,” Opt. Express 21, 15706–15718 (2013). [CrossRef]

45. E. Timurdogan, C. M. Sorace-Agaskar, J. Sun, et al., “An ultralow power athermal silicon modulator,” Nat. Commun. 5, 4008 (2014). [CrossRef]

46. X. Wu, B. Dama, P. Gothoskar, et al., “A 20 Gb/s NRZ/PAM-4 1V transmitter in 40 nm CMOS driving a Si-photonic modulator in 0.13 µm CMOS,” in IEEE International Solid-State Circuits Conference Digest of Technical Papers (IEEE, 2013), pp. 128–129.

47. S. Moazeni, S. Lin, M. Wade, et al., “A 40-Gb/s PAM-4 transmitter based on a ring-resonator optical DAC in 45-nm SOI CMOS,” IEEE J. Solid-State Circuits 52, 3503–3516 (2017). [CrossRef]

Realization of an integrated coherent photonic platform for scalable matrix operations

Abstract

1. INTRODUCTION

2. DEVICE DESIGN AND WORKING PRINCIPLE

A. Real and Complex Number Representation

3. RESULTS

A. Analog Photonic Multiplication in Real and Complex Spaces

B. Correlation Detection between Stochastic Bit Streams

C. Scalability

4. CONCLUSION

APPENDIX A: METHODS

1. Measurement Setup

2. Device Calibration

3. Power Analysis

Funding

Acknowledgment

Disclosures

Data availability

REFERENCES

Data availability

Cited By

Figures (6)

Equations (13)

Optica