Expand this Topic clickable element to expand a topic
Skip to content
Optica Publishing Group

Repairing the in situ hybridization missing data in the hippocampus region by using a 3D residual U-Net model

Open Access Open Access

Abstract

The hippocampus is a critical brain region. Transcriptome data provides valuable insights into the structure and function of the hippocampus at the gene level. However, transcriptome data is often incomplete. To address this issue, we use the convolutional neural network model to repair the missing voxels in the hippocampus region, based on Allen institute coronal slices in situ hybridization (ISH) dataset. Moreover, we analyze the gene expression correlation between coronal and sagittal dataset in the hippocampus region. The results demonstrated that the trend of gene expression correlation between the coronal and sagittal datasets remained consistent following the repair of missing data in the coronal ISH dataset. In the last, we use repaired ISH dataset to identify novel genes specific to hippocampal subregions. Our findings demonstrate the accuracy and effectiveness of using deep learning method to repair ISH missing data. After being repaired, ISH has the potential to improve our comprehension of the hippocampus's structure and function.

© 2024 Optica Publishing Group under the terms of the Optica Open Access Publishing Agreement

1. Introduction

The hippocampus, a crucial brain region that connects numerous other brain regions, is associated with diverse higher cognitive functions [1,2], and widely regarded as the anatomical of memory [3,4]. In addition, the hippocampus is also one of the earliest brain regions affected by Alzheimer's disease [5,6]. Owing to the significance of the hippocampus and the progress in transcriptomics [7], comprehending the structure and function of hippocampus at the genetic level has become one of the important research directions in the field of neuroscience over the last two decades.

The rapid development of optical imaging technology has significantly promoted transcriptomics [8], especially spatial transcriptomics [9]. Spatial transcriptomics facilitates the quantification of gene expression within the spatial context of tissues and cells [10]. Currently, numerous methods for spatial transcriptomics exist. Generally, these methods can be divided into two categories: imaging-based and sequencing-based methods [11]. The imaging-based methods include in situ hybridization (ISH) [12] and in situ sequencing (ISS) [13,14]. Meanwhile, sequencing-based methods include array [15,16] and microdissection [17,18]. Although array-based spatial transcriptome technology has developed rapidly in recent five years due to its high-throughput advantages [19], ISH has been widely used for spatial visualization of gene expression for a long time [19], and owing to its high sensitivity, it is now commonly used as a validation method for other omics studies [20,21].

In particular, ISH has been widely used in biomedical research [22], especially in neuroscience [23]. The Allen mouse brain atlas (AMBA) created by ISH provides an important reference and tool for neuroscience research [24]. In the study of hippocampus, the utilization of this dataset has played an important role in gene expression [25], brain region division [7,26,27] and connectivity pathways [28]. However, ISH has limitations including low throughput, short read lengths [10], and missing slices during data preparation, leading to instances of missing data. Previous studies mostly used interpolation [29] or filtering [30,31] to process these missing ISH data, but they often fail to take full advantage of the potential information of the absent data. In the recent decade, deep learning has been widely used in image processing [3234]. The application of convolutional neural network to handle missing data in the ISH dataset has been proved to be an effective method [35]. Nevertheless, the lack of additional validation data raises questions regarding the reliability of deep learning performance.

Given these challenges, this study used a deep learning method to reconstruct the missing data in the ISH coronal slices of the mouse hippocampus. Then, the sagittal data set provided by AMBA were used as independent datasets for validation. Finally, the repaired hippocampus ISH data were used to explore the presence of novel hippocampal subregion-specific genes.

2. Materials and methods

2.1 Mouse gene expression data

We used an adult mouse whole-brain ISH dataset [36] from AMBA [53]. The brains of mice were sectioned into coronal or sagittal slices and imaged. The resulting images were registered into a common coordinate space to produce a three-dimensional expression map, which was aligned with gene expression data by the Mouse Brain Common Coordinate Framework(CCFv3) [37]. Each image was divided into a grid with a resolution of 200*200um2, and gene expression statistics were computed by counting the signals detected in each voxel. We downloaded the gene expression “energy” volume data [38] from ISH experiments using the Allen Institute's API. Because the mouse CCFv3 template provided by Allen has a high dimension (10 × 10 × 10 µm3), down-sampling needs to be performed first. It is then used as a mask to extract voxel three-dimensional coordinate information and expression energy values of the whole brain and specific brain regions. The mouse brain coronal gene expression matrix is summarized based on the difference in expression energy information of each gene (see supplementary materials for details). The dataset was created using a variety of related experimental techniques, including brain dissection, sectioning, in situ hybridization, and expression measurement (Fig. 1(a)) [3941]. The 3D size of the dataset is 67 × 41 × 58 voxels3, and the resolution of the voxel is 200 × 200 × 200 µm3. The coronal gene dataset comprised 4345 genes, while the sagittal gene dataset contained 21,717 genes.

 figure: Fig. 1.

Fig. 1. Research framework. (a) Schematic diagram depicting the process of preparing the coronal and sagittal ISH dataset for the mouse brain provided by AMBA [53]. The coronal sagittal data set was obtained by mass anatomical extraction, sectioning, staining and quantification of the mouse brain. (b) Data preprocessing. After normalization, the number of complete genes in the hippocampal deletion mode and the whole brain set were counted respectively. (c) Model training strategy. The network training process was used to prepare the training data set and input it into the model training. (d) Model performance test and result visualization. Predict dataset and test dataset were used for internal verification of image evaluation metric. The missing set was externally validated by correlation analysis with sagittal hippocampal data, corresponding to gene names before and after prediction. Visualization was accomplished through multiple web pages and tools.

Download Full Size | PDF

The coronal data were preferred for analysis due to their superior alignment with the reference model (CCFv3) compared to the sagittal data. We extracted the gene expression matrix with 3D information according to mouse brain template obtained from AMBA to serve as a mask. It was found that there were some missing data caused by experimental errors, including improper sectioning, artifacts during slice storage, and bubbles. This is responsible for a significant portion of the missing data across the whole brain slices. Simultaneously, we also faced the challenge of incomplete expression data for the 4345 genes in the coronal section, complicating the validation of the reconstructed slice dataset. The research focused on a total of 20 coronal slices, including those containing the hippocampus. Missing patterns were counted according to whether all voxels in the dataset expressed a value of -1.

2.2 Normalization processing

The distribution of the data exhibits an exponential decay pattern as the expression value increases (see Fig. 3(a)). In order to reduce data instability caused by expression dimensions, we use log2 transformation on the overall data.

 figure: Fig. 2.

Fig. 2. Schematic diagram of network model training. (a) specific structure framework of 3D Residual U-Net network, using an annotation template to store 3D data, and employing the Smooth L1 loss function to measure model robustness. (b) The missing dataset was created using the missing patterns of the fully expressed genes in the hippocampal coronal section and the absent genes from AMBA. The neural network was trained to predict the missing patterns and fill in the corresponding voxel expression values of the complete genes. The network learned the mapping relationship between input and output, and the absent gene data from the hippocampus was used for prediction.

Download Full Size | PDF

 figure: Fig. 3.

Fig. 3. AMBA Coronal ISH dataset gene expression distribution and the missing status of ISH slice. (a) The expression distribution of genes in the whole brain, whole brain set and hippocampus in the coronal data set, the heatmap shows the overall distribution of gene deletion, and the bar graph shows the overall distribution of gene expression. (b) The proportion of deletion of genes in the hippocampus in the coronal data set (Miss: missing genes; Complete: complete genes). (c) The proportion of the deletion of individual slices according to the expression of the hippocampus in 16 of 58 coronal sections of the whole brain (Part deletion: partial deletion; Complete deletion: complete deletion).

Download Full Size | PDF

Due to the influence of gene batch effects, there are different dimensions or orders of magnitude in the data. Normalization processing allows each gene to be compared on the same scale, ensuring that the results are more accurate and reliable [42]. The formula is as follows:

$${x_{new}} = \frac{{{{\log }_2}({x + 1} )- {{\log }_2}({{x_{\min }} + 1} )}}{{{{\log }_2}({{x_{\max }} + 1} )- {{\log }_2}({{x_{\min }} + 1} )}})$$
xmax is the maximum pixel value of the image, and xmin is the minimum pixel value of the image.

2.3 Network model

We used two models, 3D U-Net and K-Nearest Neighbors (KNN), to compare with the 3D Residual U-Net model. The traditional KNN algorithm inadequately addresses missing voxels due to the limitations of learning adjacent voxel intensity values in the interpolation process, owing to the fixed direction. To address this, we used an iterative KNN algorithm which involves calculating the number of adjacent expression voxels in all the missing voxels, interpolating and filling based on the trend from more to less. Additionally, the completed voxels in the previous round automatically update the information, entering the next round of interpolation. This approach effectively compensates for the limitations of the previous algorithm, resulting in a significant enhancement in the filling effect.

3D U-Net represents a 3D image convolutional neural network model based on the U-Net architecture [43], which is widely used in medical image segmentation tasks. Its fundamental concept involves decomposing the image into various resolution levels, and use a convolutional neural network to gradually refine and accurately predict the results of each level. Building upon the 3D U-Net framework, the 3D Residual U-Net enhances the model by incorporating a residual module into the encoder. This connection mode enables the network to capture the subtle differences between input and output, reducing the occurrence of decay in the learning process, and making the model training process more stable [44]. In the medical image segmentation task, due to the presence of various fine structures and lesions in the image, the use of 3D Residual U-Net can better retain these details, thereby improving the accuracy of segmentation.

The 3D Residual U-Net network framework includes an encoder and a decoder (Fig. 2(a)). The encoder is used to extract high-level features from the original image, and the decoder maps these features back into the original image space to generate the final predicted feature map. In the encoder part, 3D Residual U-Net uses multi-stage down-sampling (max pooling/average pooling) to decompose the image into different resolution levels, and simultaneously uses convolution and nonlinear activation functions to extract features. In the decoder part, 3D Residual U-Net uses multi-level up-sampling (transposed convolution) to map features back to the original image space, while skip connections are used to help maintain the information flow between different levels [45]. For the loss function selection, the smooth L1 function is chosen due to the data range after processing falling between 0-1, which is equivalent to the MSE function. Compared with other loss functions, due to its advantages of small gradient changes, it is suitable for the fitting effect of samples with large data differences, so as to prevent the occurrence of over-fitting or gradient explosion. A mirrored extension is also used to avoid angular and boundary artifacts during up-sampling. In 3D Residual U-Net, the feature map of the decoder part is first extended by a mirror image, and then an up-sampling operation is performed.

The previous section introduced the preparation of dataset. We divided the prepared dataset into training set, validation set and test set according to the ratio of 7:2:1. The model was trained using training set and validation set applied to optimize the model for improved performance. After the training is completed, the test set is used to evaluate the model performance.

The shape of the raw image is 41 × 58 × 67, and we took the patch of 41 × 58 × 20 as the input image. Because there are a large number of missing data in whole-brain ISH data, selecting the whole-brain set can obtain partially fully expressed gene data that can be used to train the model. Normalization was performed on the images by subtracting the mean and dividing by the standard deviation. Due to the large amount of data, we did not consider using data augmentation techniques.

In our study, smooth L1 loss applied to train model. The advantage of smooth L1 loss is that stable gradient and gradually converge to the optimal solution during training process. In addition, the number of epochs was set to 250, the batch-size was set to 1. The experiment was validated once every 2000 iterations. We used a learning rate scheduler that automatically adjusts the learning rate based on changes in the monitored metrics. When the performance metrics on the validation set stop improving, the scheduler gradually decreases the learning rate to allow the model to converge better.

We initialized the learning rate at 8 × 10−6. When the loss does not decrease, the learning rate is half of the original. Early stopping was a training strategy which monitors the performance changes of the model after each epoch during the training process. When there is no decline in the loss value during the training process, the training process was terminated. The patience of early stopping was set to ten epochs, so the training process is terminated if the validation loss does not decrease after ten epochs.

We use PyTorch to conduct model training on a 24-core NVIDIA RTX A5000 GPU computer. The details of model hyperparameters are shown in Supplementary Table S1.

2.4 Training set preparation

We divided the missing gene data set according to the presence or absence of a deletion in the hippocampus, and then divided the complete gene data set according to the presence or absence of a deletion in the hippocampal coronal slice data set. The deletion of the hippocampal coronal slice data set corresponding to the missing genes was counted, and the deletion patterns of the extracted corresponding missing genes were combined with the complete gene data set by random sampling method to generate missing data. Since the data set is three-dimensional, the traditional file format is only suitable for storing two-dimensional data, so we adopted the H5 file format to save the complete data set of the corresponding gene, the generated missing data set and the missing pattern into a single H5 file for the convenience of subsequent network training. The initial data set was divided into training set, test set and validation set with a ratio of 7:2:1. The training set and validation set were put into the model training together, and the test set was used to evaluate the performance of the model.

2.5 Metric of assessment

We use the missing image data generated in the training set to restore the original fully expressed image data through model training. After the model fitting is completed, the missing image data in the test set is used to generate predicted image data through the fitted model.

To compare the repair performance of different models, we analyze the similarity between the predicted image data and the original fully expressed image data from two perspectives: the entire hippocampus and the coronal slice where the hippocampus is located. Mean Squared Error (MSE), Peak Signal-to-Noise Ratio (PSNR), and Structural Similarity Index (SSIM) are used as image evaluation metric.

Because the overall expression distribution of the data is biased (see Fig. 3(a)), we used Spearman correlations to evaluate the similarity changes between coronal hippocampal data and sagittal hippocampal data before and after the same gene repair. All data were processed using libraries such as Scikit-image, Scikit-learn and Scipy in Python and R tools (See the data analysis processing section in Supplementary Methods for details).

PSNR: It is used to measure the difference between the distorted image and the reference image and is one of the most commonly used image quality evaluation metrics. A higher PSNR value indicates less distortion and better image quality.

$$PSNR = 10{\log _{10}}(\frac{{MAX_I^2}}{{MSE}})$$

MAXI is the maximum pixel value of the image, and MSE is the Mean Squared Error.

MSE: It is used to measure the difference between the distorted image and the reference image. A smaller MSE value indicates less distortion and better image quality.

$$MSE = {\textstyle{1 \over {MN}}}\sum\nolimits_{i = 1}^M {} \sum\nolimits_{j = 1}^N {{{({I_{ij}} - {K_{ij}})}^2}})$$

Iij is the pixel value of the reference image, Kij is the pixel value of the distorted image, M and N are the width and height of the image.

SSIM: It is used to evaluate the similarity between the distorted image and the reference image, comparing brightness, contrast, and structure.

$$SSIM(I,K) = \frac{{(2{\mu _I}{\mu _K} + {C_1})(2{\sigma _{IK}} + {C_2})}}{{(\mu _I^2 + \mu _K^2 + {C_1})(\sigma _I^2 + \sigma _K^2 + {C_2})}})$$
μI and μK denote the mean of the reference and distorted images, σI2 and σK2 denote the variance of the reference and distorted images, σIK denote the covariance of the reference and distorted images, and C1 and C2 are constants to prevent the denominator from being zero.

Spearman correlation coefficient (rs) It is used to evaluate the correlation between two images, with a value range of [-1, 1]. A higher r value indicates a stronger correlation between the two images.

$${r_S}(I,K) = \frac{{{\textstyle{1 \over N}}\sum\nolimits_{i = 1}^N {({I_i} - \bar{I})({K_i} - \bar{K})} }}{{\sqrt {({\textstyle{1 \over N}}{{\sum\nolimits_{i = 1}^N {({I_i} - \bar{I})} }^2})({\textstyle{1 \over N}}{{\sum\nolimits_{i = 1}^N {({K_i} - \bar{K})} }^2})} }})$$

I and K represent the sequence of pixel values of the reference image and the distorted image, respectively, N represents the number of pixels of the image, and $\bar{{\boldsymbol I}}$ and $\bar{{\boldsymbol K}}$ represent the mean values of I and K. It can be seen that the calculation of the Spearman correlation coefficient depends on the pixel values of the two images, so only two images with corresponding pixel values can be compared for correlation.

2.6 Validation of sagittal data

After training the model to complete the hippocampal deletion gene set, the gene set from the coronal hippocampal region, before and after prediction, was selected to compare with the sagittal data of the same gene to verify the performance of the network model. Due to the correlation between the coronal gene set and one or more gene name datasets in the sagittal data set, the data sets corresponding to the coronal and sagittal gene names were filtered out before analysis.

2.7 Visualization of specific genes

We observed coronal gene expression in the hippocampus before and after completion, while validating the sagittal data. Considering that the sagittal data only has the left hemisphere of the brain, and we obtained the same expression trend between the left and right hippocampus when analyzing the coronal hippocampal data, the left hippocampus was used for analysis. The three-dimensional location information and expression information of the hippocampus were integrated to realize three-dimensional scatter visualization, and the extracted hippocampal subregions were screened for specific genes by manual alignment. All analysis were processed using Origin 2018.

3. Results

3.1 Analysis of missing sections

There are a large number of missing slices in the Allen mouse brain ISH dataset. In order to solve this problem, we need to repair the missing voxels in ISH slices. The missing ISH data pattern for whole mouse brain, whole hippocampal coronal slices, and whole hippocampal region is shown in the Fig. 3(a). According to the analysis results, we found that as the expression energy increases, the expression distribution decreases exponentially in the whole brain and whole-brain set data. By performing median statistics on all data, it was found that the median, upper and lower quartiles expression distribution of the whole coronal gene in the hippocampus was 1.15 [0.14, 5.43], and in the whole brain set including the hippocampus, it was 0.83 [0.087, 4.71]. On the brain it is 0.7 [0.065,4.11]. Their values are all in the range of 0.7-1.2, and the upper quartiles (<75%) are in the range of 4.1-5.5, which indicated that they all have similar expression distribution trends, and at the same time, low-expression voxels account for a higher proportion of the overall data. The deletion of coronal genes under different data sets were shown in Fig. 3(b). We computed the proportion of all data containing missing voxel genes. In addition to the deletion of 4345 coronal genes in the whole brain region, we found that a total of 581 genes were fully expressed in the whole hippocampal coronal slices, while another 3764 genes were missing, accounting for 86.6% of the total genes. In the whole hippocampus region, we found that a total of 2435 genes were fully expressed, and another 1940 genes were missing, accounting for 44.6% of the total genes.

The whole brain data displayed incomplete expression for 4345 coronal genes. In the hippocampal coronal slice data, 581 genes exhibited complete expression, while 3764 genes had missing expression, accounting for 86.6% of the total. Similarly, the hippocampus featured 2405 genes with complete expression and 1940 genes with missing expression, accounting for 44.6% of the total. As the voxel count in brain regions decreases, the number of genes with complete expression in coronal sections increases, indicating that most of the missing data in coronal brain slices are partial rather than complete. Among the 1940 missing genes, each gene has a distinct pattern of missing data. The statistical analysis of the missing gene expression in each coronal slice of the hippocampus indicates that 50% of the slices exhibit partial missing data. The proportion of missing slices is 16.2%, resulting in an average loss of 2.6 hippocampal slices per gene (see Fig. 3(c)).

3.2 Test the effect of missing repair

We compared the model performance and predictive effect of image repair by 3D Residual U-Net, 3D U-Net and K-Nearest Neighbor at the same dataset. Among the models, the 3D Residual U-Net network exhibited clear superiority over the other two models, particularly the iterative KNN model (see Fig. 4(a)). Just analysis the hippocampal region, the average of MSE, PSNR, and SSIM for iterative KNN model are 1.162, 34.35, and 0.927, respectively. The 3D U-Net model was superior to the iterative KNN in the same index of 0.848, 36.49 and 0.947. The best effect model is 3D Residual U-Net, are 0.639, 36.92, and 0.952.

 figure: Fig. 4.

Fig. 4. Model Performance Validation. (a) 3D residual U-Net, 3D U-Net, and iterative KNN were used for hippocampal coronal slice data and hippocampal dataset test set. Image evaluation index was used to analyze the model performance. (b) The visualization effect and metric of the three models of 3D Residual U-Net, 3D U-Net and Iterative KNN (taking gene Mef2c effect as an example). (MSE: Mean Squared Error; PSNR: Peak Signal-to-Noise Ratio; SSIM: Structural Similarity Index)

Download Full Size | PDF

As the example of visualized results, Fig. 4(b) showed the repaired 3D images of Mef2c gene ISH slices in the 3 miss modes. Compare with the non-missing data in the left column, the iterative KNN results exhibited discontinuity in repairing the missing hippocampal voxels. While the 3D U-Net methods improved the repaired results continuity, it displayed instability in predicting high-expression voxels. In contrast, the 3D Residual U-Net exhibited superior completion performance.

3.3 Coronal versus sagittal validation

To verify the reliability of 3D Residual U-Net method in the ISH coronal data repair, we further validated its performance using sagittal data set. Distribution of correlation values for expressed voxels in the coronal and sagittal data sets before and after completion are similar (Fig. 5(a)), with no significant statistical changes (Fig. 5(b)). Additionally, the mean of 0.1782 and a median of 0.1635 pre-prediction, compared to a mean of 0.1757 and a median of 0.1618 post-prediction. As the sensitivity analysis results, the Pearson correlation is significance similar with Spearman correlation results (see Figure S1) and the R-squared reaches 0.98 (see Figure S2), indicating that the correlation distribution is highly consistent.

 figure: Fig. 5.

Fig. 5. Sagittal data set validation. (a) The correlation distribution of coronal and sagittal data corresponding to gene names before and after prediction by 3D Residual U-Net model, and Spearman correlation coefficient was used for statistical analysis (Miss: there was correlation between expression and coronal missing voxels in sagittal data; Raw: before model prediction; Predict: after model prediction). (b) Coronal and sagittal correlations before and after prediction for statistical test of significance. (c) Example of visualization of the left hippocampus before and after coronal data prediction and sagittal data with the same gene name.

Download Full Size | PDF

At the same time, genes Anxa5, Acyp2, Parvb and Trhde were selected as examples to visualize the gene expression levels in the hippocampus before and after prediction. The reliability of the method was qualitatively analyzed by means of syngeneic sagittal data validation.

3.4 Hippocampal subregion specific gene explore

To further explore the potential applications of our repaired hippocampal ISH dataset, we initial manually visualize exploration the hippocampal subregion-specific genes. We identified novel hippocampal subregion-specific genes compare with previous studies [7,26,27] and displayed their specific expression in various hippocampal subregions in Fig. 6, such as Nr2f1 (CA3 v), Trhde (CA3d), and Dpf3 (DGd). And these gene after repaired (middle column in Fig. 6) can be verified by sagittal slice dataset (right column in Fig. 6). We found that the completion of these missing regions by our network allowed us to discover more potential expression information that was originally missing due to experimental preparation. This provides additional data support for our research on the structure and function of the hippocampal.

 figure: Fig. 6.

Fig. 6. Screening of Hippocampal Subregion-Specific Genes. Manual screening of specifically expressed genes after coronal and sagittal visualization of the left hippocampus. The right column is raw coronal ISH data, middle column is repaired ISH data, and right column is sagittal ISH data.

Download Full Size | PDF

4. Discussion

In current study, we used a deep learning method to repair the missing data of mouse hippocampus ISH coronal slices. First, we paired the incomplete expression gene data of each gene in the hippocampal coronal data set with the complete data to establish a training set for model (see Fig. 4). Secondly, due to the extensive coverage of coronal gene by the substantial sagittal data set, we proposed to validate the sagittal data set of the same gene with the coronal data set. As illustrated in the results (see Fig. 5), the overall trend of coronal sagittal correlation remained consistent pre- and post-completion, indicating that the predicted missing data aligned with the 3D structural information of the original dataset. The validation of the sagittal data set strengthened the reliability of the model prediction and reflected the robustness of the method. Finally, we made an effort to investigate the potential utility of the completed hippocampal data. Upon visualizing the gene expression results in different subregions of the coronal and sagittal hippocampus (see Fig. 6). we identified the presence of novel hippocampal subregion-specific expressed genes. Furthermore, this forms a novel base dataset for the subsequent development of a comprehensive mouse brain ISH dataset.

Based on our knowledge, we validated the sagittal data set against the coronal data set for the first time to demonstrate the effect of recovery of missing slices data. With an additional set of ISH datasets, we verified that the expression pattern of the hippocampus was consistent between coronal and sagittal data set for the same gene, affirming the reliability of deep learning methods in predicting missing slice data. The reason for choosing the sagittal data set, on the one hand, it adheres to the same experimental workflow as the coronal data set, and thus are more convincing in terms of validation, especially relative to other omics data. On the other hand, the importance of our work is highlighted by the fact that transcriptomes describe biological functional states in a transient manner, necessitating verification of the same biological phenomena across multiple instances [46]. Compared with the latest spatial transcriptome technology, although it has the advantages of high throughput and single cell resolution, which can achieve the whole transcriptome analysis of a single mouse brain [47,48], it has the disadvantage that all genes are expressed from the same organism at the same moment. Therefore, it is not possible to achieve single-gene ISH validation equivalent to AMBA ISH data in terms of biological replication. In addition, low expression genes are filtered out by the influence of instruments and equipment, which makes accurate quantification difficult [16]. By analyzing the correlation between coronal and sagittal data sets, our study further verified the effect of data completion and provided more complete data support for follow-up studies. Simultaneously, this verification approach presents novel research directions for future investigators.

The results indicate that the 3D Residual U-Net has better performance than other models in terms of metric, to verify the superiority of the 3D Residual U-Net (see Fig. 5(a)). There is a gap between machine learning and deep learning models in the performance of processing data. As one of the most commonly used interpolation methods in machine learning, KNN interpolation is often used in the restoration task of small-scale images [29]. However, it lacks the ability to obtain global information [35], especially the processing of complex textures and details in the loss area is highly subjective. In contrast, 3D U-Net based on deep learning methods can effectively obtain image context information and has the ability of autonomous learning, so it can generally obtain more natural and higher accuracy in the restoration effect [49]. Nevertheless, previous studies have shown that the 3D U-Net network takes a long time and the results are not stable enough [50], so we have improved the 3D U-Net structure. 3D Residual U-Net increases the feature depth and complexity through residual connection, which can better extract image information [44]. It has better generalization performance and robustness in performance, and effectively avoids the occurrence of gradient decay phenomenon.

In the last, we investigated the repaired hippocampal coronal data set to identify specific genes for different hippocampal subregions. Specific genes have long been used to define the boundaries of brain regions, especially the boundaries of subregions and subcortices, due to their specific expression [51,52]. Previous studies have explored the demarcation of hippocampal subregion boundaries [4] and found that CA3 is composed of nine domains, each of which is responsible for a variety of different functions [26], while CA1 and DG are confirmed to be composed of three continuous domains [7,27]. In addition, a variety of specific gene combinations have been used to determine the boundary demarcation of hippocampal subregions, so as to draw the two-dimensional coronal and sagittal maps of the hippocampus and determine the functional connectivity range with other brain regions [28]. We validated the approach based on the coronal and sagittal correlation to identify subregion-specific genes (e.g., Wfs1, Zbtb20, Dcn, Htr2c, Grp, etc.) previously identified by manual screening and found that most of them were also expressed in the coronal and sagittal alignment (see Fig. 6(a)). After screening the completed hippocampal coronal gene set, we identified novel hippocampal subregion-specific genes, as well as genes that show specific expression in the coronal data set but inconsistent expression in the sagittal data set (see Fig. 6(b)). By completing the missing data in the coronal section of the hippocampus, we found more genes that are specifically expressed in the hippocampus, which may have important implications for our understanding of the structure and function of the hippocampal subregions.

5. Conclusion

The vast data resources provided by the Allen Institute for Brain Science have always served as a platform for data analysis for researchers. Constructing comprehensive datasets is beneficial for providing more research ideas for subsequent studies and for validating data results obtained through other technical analyses. In the future, it will have important research and reference value for comparative research on mouse brains and across species.

Funding

Ministry of Science and Technology of the People's Republic of China (2022YEF0203200, 2022ZD0212200); National Natural Science Foundation of China (61890950, 61890951, 62275095, 82260227); Natural Science Foundation of Hainan Province (821QN226); Hainan University (KYQD(ZR)20072, KYQD(ZR)22074).

Disclosures

The authors declare no conflicts of interest.

Data availability

We used data from the Allen Brain Institute [53]. The experimental platform provided a comprehensive digital atlas of 20,000 gene expression patterns in the adult mouse brain [36]. AMBA ISH data sets can be downloaded through API [54]. Custom code supporting the current study is available at [55].

Supplemental document

See Supplement 1 for supporting content.

References

1. S. Genon, B. C. Bernhardt, R. La Joie, et al., “The many dimensions of human hippocampal organization and (dys)function,” Trends Neurosci. 44(12), 977–989 (2021). [CrossRef]  

2. H. J. Shi, S. Wang, X. P. Wang, et al., “Hippocampus: molecular, cellular, and circuit features in anxiety,” Neurosci. Bull. 39(6), 1009–1026 (2023). [CrossRef]  

3. N. M. van Strien, N. L. Cappaert, and M. P. Witter, “The anatomy of memory: an interactive overview of the parahippocampal-hippocampal network,” Nat. Rev. Neurosci. 10(4), 272–282 (2009). [CrossRef]  

4. B. A. Strange, M. P. Witter, E. S. Lein, et al., “Functional organization of the hippocampal longitudinal axis,” Nat. Rev. Neurosci. 15(10), 655–669 (2014). [CrossRef]  

5. N. S. Pentkowski, K. K. Rogge-Obando, T. N. Donaldson, et al., “Anxiety and Alzheimer's disease: Behavioral analysis and neural basis in rodent models of Alzheimer's-related neuropathology,” Neurosci. Biobehav. Rev. 127, 647–658 (2021). [CrossRef]  

6. S. Maleki Balajoo, S. B. Eickhoff, S. K. Masouleh, et al., “Hippocampal metabolic subregions and networks: Behavioral, molecular, and pathological aging profiles,” Alzheimer’s Dementia 19(11), 4787–4804 (2023). [CrossRef]  

7. M. S. Fanselow and H. W. Dong, “Are the dorsal and ventral hippocampus functionally distinct structures?” Neuron 65(1), 7–19 (2010). [CrossRef]  

8. H. Shi, Y. Guan, J. Chen, et al., “Optical Imaging in Brainsmatics,” Photonics 6(3), 98 (2019). [CrossRef]  

9. S. M. Lewis, M. L. Asselin-Labat, Q. Nguyen, et al., “Spatial omics and multiplexed imaging to explore cancer biology,” Nat. Methods 18(9), 997–1012 (2021). [CrossRef]  

10. L. Tian, F. Chen, and E. Z. Macosko, “The expanding vistas of spatial transcriptomics,” Nat. Biotechnol. 41(6), 773–782 (2023). [CrossRef]  

11. C. G. Williams, H. J. Lee, T. Asatsuma, et al., “An introduction to spatial transcriptomics for biomedical research,” Genome Med. 14(1), 68 (2022). [CrossRef]  

12. F. Wang, J. Flanagan, N. Su, et al., “RNAscope: a novel in situ RNA analysis platform for formalin-fixed, paraffin-embedded tissues,” J. Mol. Diagn. 14(1), 22–29 (2012). [CrossRef]  

13. E. Lubeck, A. F. Coskun, T. Zhiyentayev, et al., “Single-cell in situ RNA profiling by sequential hybridization,” Nat. Methods 11(4), 360–361 (2014). [CrossRef]  

14. K. H. Chen, A. N. Boettiger, J. R. Moffitt, et al., “RNA imaging. Spatially resolved, highly multiplexed RNA profiling in single cells,” Science 348(6233), aaa6090 (2015). [CrossRef]  

15. S. G. Rodriques, R. R. Stickels, A. Goeva, et al., “Slide-seq: A scalable technology for measuring genome-wide expression at high spatial resolution,” Science 363(6434), 1463–1467 (2019). [CrossRef]  

16. A. Chen, S. Liao, M. Cheng, et al., “Spatiotemporal transcriptomic atlas of mouse organogenesis using DNA nanoball-patterned arrays,” Cell 185(10), 1777–1792.e21 (2022). [CrossRef]  

17. M. R. Emmert-Buck, R. F. Bonner, P. D. Smith, et al., “Laser capture microdissection,” Science 274(5289), 998–1001 (1996). [CrossRef]  

18. J. Chen, S. Suo, P. P. Tam, et al., “Spatial transcriptomic analysis of cryosectioned tissue samples with Geo-seq,” Nat Protoc 12(3), 566–580 (2017). [CrossRef]  

19. L. Moses and L. Pachter, “Museum of spatial transcriptomics,” Nat. Methods 19(5), 534–546 (2022). [CrossRef]  

20. M. S. Cembrowski, J. L. Bachman, L. Wang, et al., “Spatial Gene-Expression Gradients Underlie Prominent Heterogeneity of CA1 Pyramidal Neurons,” Neuron 89(2), 351–368 (2016). [CrossRef]  

21. Z. Yao, C. T. J. van Velthoven, T. N. Nguyen, et al., “A taxonomy of transcriptomic cell types across the isocortex and hippocampal formation,” Cell 184(12), 3222–3241.e26 (2021). [CrossRef]  

22. L. Jiang, X. Xie, N. Su, et al., “Large Stokes shift fluorescent RNAs for dual-emission fluorescence and bioluminescence imaging in live cells,” Nat. Methods 20(10), 1563–1572 (2023). [CrossRef]  

23. I. Shainer, E. Kuehn, E. Laurell, et al., “A single-cell resolution gene expression atlas of the larval zebrafish brain,” Sci. Adv. 9(8), eade9909 (2023). [CrossRef]  

24. L. Ng, A. Bernard, C. Lau, et al., “An anatomic gene expression atlas of the adult mouse brain,” Nat Neurosci 12(3), 356–362 (2009). [CrossRef]  

25. M. S. Cembrowski, L. Wang, K. Sugino, et al., “Hipposeq: a comprehensive RNA-seq database of gene expression in hippocampal principal neurons,” Elife 5, e14997 (2016). [CrossRef]  

26. C. L. Thompson, S. D. Pathak, A. Jeromin, et al., “Genomic anatomy of the hippocampus,” Neuron 60(6), 1010–1021 (2008). [CrossRef]  

27. H. W. Dong, L. W. Swanson, L. Chen, et al., “Genomic-anatomic evidence for distinct functional domains in hippocampal field CA1,” Proc. Natl. Acad. Sci. U.S.A. 106(28), 11794–11799 (2009). [CrossRef]  

28. M. S. Bienkowski, I. Bowman, M. Y. Song, et al., “Integration of gene expression and brain-wide connectivity reveals the multiscale organization of mouse hippocampal networks,” Nat. Neurosci. 21(11), 1628–1643 (2018). [CrossRef]  

29. A. Beauchamp, Y. Yee, B. C. Darwin, et al., “Whole-brain comparison of rodent and human brains using spatial transcriptomics,” Elife 11, e79418 (2022). [CrossRef]  

30. Y. Li, H. Chen, X. Jiang, et al., “Transcriptome Architecture of Adult Mouse Brain Revealed by Sparse Coding of Genome-Wide In Situ Hybridization Images,” Neuroinformatics 15(3), 285–295 (2017). [CrossRef]  

31. Y. Li, H. Chen, X. Jiang, et al., “Discover mouse gene coexpression landscapes using dictionary learning and sparse coding,” Brain Struct Funct 222(9), 4253–4270 (2017). [CrossRef]  

32. J. Sun, L. Yuan, J. Jia, et al., “Image completion with structure propagation,” in ACM SIGGRAPH 2005 Papers (2005), pp. 861–868.

33. J. Hays and A. A. Efros, “Scene completion using millions of photographs,” ACM Trans. Graph. 26(3), 4 (2007). [CrossRef]  

34. J. Xie, L. Xu, and E. Chen, “Image denoising and inpainting with deep neural networks,” Advances in neural information processing systems 25, 1 (2012).

35. Y. Li, H. Huang, H. Chen, et al., “Deep Neural Networks for In Situ Hybridization Grid Completion and Clustering,” IEEE/ACM Trans Comput Biol Bioinform 17, 1 (2019). [CrossRef]  

36. E. S. Lein, M. J. Hawrylycz, N. Ao, et al., “Genome-wide atlas of gene expression in the adult mouse brain,” Nature 445(7124), 168–176 (2007). [CrossRef]  

37. Q. Wang, S. L. Ding, Y. Li, et al., “The Allen Mouse Brain Common Coordinate Framework: A 3D Reference Atlas,” Cell 181(4), 936–953.e20 (2020). [CrossRef]  

38. C. K. Lee, S. M. Sunkin, C. Kuan, et al., “Quantitative methods for genome-scale analysis of in situ hybridization and correlation with microarray data,” Genome Biol 9(1), R23 (2008). [CrossRef]  

39. P. A. Yushkevich, B. B. Avants, L. Ng, et al., “3D mouse brain reconstruction from histology using a coarse-to-fine approach,” in Biomedical Image Registration: Third International Workshop, WBIR 2006, Utrecht, The Netherlands, July 9-11, 2006. Proceedings 3, (Springer, 2006), 230–237.

40. L. Ng, S. Pathak, C. Kuan, et al., “Neuroinformatics for genome-wide 3D gene expression mapping in the mouse brain,” IEEE/ACM Trans. Comput. Biol. and Bioinf. 4(3), 382–393 (2007). [CrossRef]  

41. L. L. Ng, S. M. Sunkin, D. Feng, et al., “Large-scale neuroinformatics for in situ hybridization data in the mouse brain,” Int Rev Neurobiol 104, 159–182 (2012). [CrossRef]  

42. A. Arnatkeviciute, B. D. Fulcher, and A. Fornito, “A practical guide to linking brain-wide gene expression and neuroimaging data,” NeuroImage 189, 353–367 (2019). [CrossRef]  

43. O. Ronneberger, P. Fischer, and T. Brox, “U-Net: Convolutional Networks for Biomedical Image Segmentation,” in Medical Image Computing and Computer-Assisted Intervention – MICCAI 2015 (2015), pp. 234–241.

44. K. He, X. Zhang, S. Ren, et al., “Deep Residual Learning for Image Recognition,” in 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), (2016), pp. 770–778.

45. D. Pathak, P. Krahenbuhl, J. Donahue, et al., “Context encoders: Feature learning by inpainting,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2016), 2536–2544.

46. M. Blumenberg, “Introductory chapter: transcriptome analysis,” Transcriptome analysis 370, 1–5 (2019). [CrossRef]  

47. Y. Wang, B. Liu, G. Zhao, et al., “Spatial transcriptomics: Technologies, applications and experimental considerations,” Genomics 115(5), 110671 (2023). [CrossRef]  

48. A. Robles-Remacho, R. M. Sanchez-Martin, and J. J. Diaz-Mochon, “Spatial Transcriptomics: Emerging Technologies in Tissue Gene Expression Profiling,” Anal. Chem. 95(42), 15450–15460 (2023). [CrossRef]  

49. A. Andonian, D. Paseltiner, T. J. Gould, et al., “A deep learning based method for large-scale classification, registration, and clustering of in-situ hybridization experiments in the mouse olfactory bulb,” J. Neurosci. Methods 312, 162–168 (2019). [CrossRef]  

50. R. Raza, U. Ijaz Bajwa, Y. Mehmood, et al., “dResU-Net: 3D deep residual U-Net based brain tumor segmentation from multimodal MRI,” Biomedical Signal Processing and Control 79, 103861 (2023). [CrossRef]  

51. T. Yamamori and K. S. Rockland, “Neocortical areas, layers, connections, and gene expression,” Neurosci. Res. 55(1), 11–27 (2006). [CrossRef]  

52. D. D. O’Leary, S.-J. Chou, and S. Sahara, “Area patterning of the mammalian cortex,” Neuron 56(2), 252–269 (2007). [CrossRef]  

53. Allen Institute for Brain Science, “Allen Brain Map,” Allen Institute, 2024, https://portal.brain-map.org/.

54. Allen Brain Map Community Forum, “Downloading 3-D Expression Grid Data,” Allen Institute, 2024, https://community.brain-map.org/t/downloading-3-d-expression-grid-data/2859.

55. ZjjLab, “Processing and analysis of Allen mouse brain ISH data,” Github, 2024, https://github.com/ZjjLab/Processing-and-analysis-of-Allen-mouse-brain-ISH-data.

Supplementary Material (1)

NameDescription
Supplement 1       Supplemental materials

Data availability

We used data from the Allen Brain Institute [53]. The experimental platform provided a comprehensive digital atlas of 20,000 gene expression patterns in the adult mouse brain [36]. AMBA ISH data sets can be downloaded through API [54]. Custom code supporting the current study is available at [55].

53. Allen Institute for Brain Science, “Allen Brain Map,” Allen Institute, 2024, https://portal.brain-map.org/.

36. E. S. Lein, M. J. Hawrylycz, N. Ao, et al., “Genome-wide atlas of gene expression in the adult mouse brain,” Nature 445(7124), 168–176 (2007). [CrossRef]  

54. Allen Brain Map Community Forum, “Downloading 3-D Expression Grid Data,” Allen Institute, 2024, https://community.brain-map.org/t/downloading-3-d-expression-grid-data/2859.

55. ZjjLab, “Processing and analysis of Allen mouse brain ISH data,” Github, 2024, https://github.com/ZjjLab/Processing-and-analysis-of-Allen-mouse-brain-ISH-data.

Cited By

Optica participates in Crossref's Cited-By Linking service. Citing articles from Optica Publishing Group journals and other participating publishers are listed here.

Alert me when this article is cited.


Figures (6)

Fig. 1.
Fig. 1. Research framework. (a) Schematic diagram depicting the process of preparing the coronal and sagittal ISH dataset for the mouse brain provided by AMBA [53]. The coronal sagittal data set was obtained by mass anatomical extraction, sectioning, staining and quantification of the mouse brain. (b) Data preprocessing. After normalization, the number of complete genes in the hippocampal deletion mode and the whole brain set were counted respectively. (c) Model training strategy. The network training process was used to prepare the training data set and input it into the model training. (d) Model performance test and result visualization. Predict dataset and test dataset were used for internal verification of image evaluation metric. The missing set was externally validated by correlation analysis with sagittal hippocampal data, corresponding to gene names before and after prediction. Visualization was accomplished through multiple web pages and tools.
Fig. 2.
Fig. 2. Schematic diagram of network model training. (a) specific structure framework of 3D Residual U-Net network, using an annotation template to store 3D data, and employing the Smooth L1 loss function to measure model robustness. (b) The missing dataset was created using the missing patterns of the fully expressed genes in the hippocampal coronal section and the absent genes from AMBA. The neural network was trained to predict the missing patterns and fill in the corresponding voxel expression values of the complete genes. The network learned the mapping relationship between input and output, and the absent gene data from the hippocampus was used for prediction.
Fig. 3.
Fig. 3. AMBA Coronal ISH dataset gene expression distribution and the missing status of ISH slice. (a) The expression distribution of genes in the whole brain, whole brain set and hippocampus in the coronal data set, the heatmap shows the overall distribution of gene deletion, and the bar graph shows the overall distribution of gene expression. (b) The proportion of deletion of genes in the hippocampus in the coronal data set (Miss: missing genes; Complete: complete genes). (c) The proportion of the deletion of individual slices according to the expression of the hippocampus in 16 of 58 coronal sections of the whole brain (Part deletion: partial deletion; Complete deletion: complete deletion).
Fig. 4.
Fig. 4. Model Performance Validation. (a) 3D residual U-Net, 3D U-Net, and iterative KNN were used for hippocampal coronal slice data and hippocampal dataset test set. Image evaluation index was used to analyze the model performance. (b) The visualization effect and metric of the three models of 3D Residual U-Net, 3D U-Net and Iterative KNN (taking gene Mef2c effect as an example). (MSE: Mean Squared Error; PSNR: Peak Signal-to-Noise Ratio; SSIM: Structural Similarity Index)
Fig. 5.
Fig. 5. Sagittal data set validation. (a) The correlation distribution of coronal and sagittal data corresponding to gene names before and after prediction by 3D Residual U-Net model, and Spearman correlation coefficient was used for statistical analysis (Miss: there was correlation between expression and coronal missing voxels in sagittal data; Raw: before model prediction; Predict: after model prediction). (b) Coronal and sagittal correlations before and after prediction for statistical test of significance. (c) Example of visualization of the left hippocampus before and after coronal data prediction and sagittal data with the same gene name.
Fig. 6.
Fig. 6. Screening of Hippocampal Subregion-Specific Genes. Manual screening of specifically expressed genes after coronal and sagittal visualization of the left hippocampus. The right column is raw coronal ISH data, middle column is repaired ISH data, and right column is sagittal ISH data.

Equations (5)

Equations on this page are rendered with MathJax. Learn more.

x n e w = log 2 ( x + 1 ) log 2 ( x min + 1 ) log 2 ( x max + 1 ) log 2 ( x min + 1 ) )
P S N R = 10 log 10 ( M A X I 2 M S E )
M S E = 1 M N i = 1 M j = 1 N ( I i j K i j ) 2 )
S S I M ( I , K ) = ( 2 μ I μ K + C 1 ) ( 2 σ I K + C 2 ) ( μ I 2 + μ K 2 + C 1 ) ( σ I 2 + σ K 2 + C 2 ) )
r S ( I , K ) = 1 N i = 1 N ( I i I ¯ ) ( K i K ¯ ) ( 1 N i = 1 N ( I i I ¯ ) 2 ) ( 1 N i = 1 N ( K i K ¯ ) 2 ) )
Select as filters


Select Topics Cancel
© Copyright 2024 | Optica Publishing Group. All rights reserved, including rights for text and data mining and training of artificial technologies or similar technologies.